In this section, we will work with the cleaned and merged bike station and weather data derived from Part I. We will explore the dataset to extract insights that will help answer some of the questions posed by the City of Toronto and hopefully also provide some valuable insights determined outside the scope of work.
The list below shows the questions that were posed by the City of Toronto. We aim to answer the following questions through the analyses presented in this notebook:
When are people using the bike share system? How does usage vary across the year, the week, and the day?
Is there a difference in usage behaviour between Casual and Annual Member riders?
Is there an increasing trend of usage from 2017 to 2020 and is the trend the same for both Casual and Annual Member riders?
How popular is FREE RIDE WEDNESDAYS?
How did usage change in 2020 due to the pandemic and government-mandated lockdowns?
How do statutory holidays impact demand?
Which neighbourhoods have seen the largest number of rides depart from bike stations located within their boundaries?
Which neighbourhoods have seen the largest number of rides end at bike stations located within their boundaries?
How does the weather change the way people use the bike share system?
What weather features are most influential (temperature, humidity, precipitation, etc.)?
We have also conducted a series of analyses to provide some of our own insights into the dataset. We have attempted to answer the following questions through our analyses:
Does bike-share usage vary depending on proximity to TTC subway and streetcar stations?
If everyone is travelling along bike paths, which bike paths are the most congested and at what times of the day?
Are there seasonal trends in trip duration?
# Import 3rd party libraries
import os
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import pytz
import chardet
import string
import datetime
import geopandas as gpd
import folium
from folium import Choropleth
from folium import Marker
import holidays
import osmnx as ox
from folium import GeoJson
from folium import Popup
import networkx as nx
from matplotlib import ticker
from shapely.geometry import LineString, Point
from shapely import geometry, ops
from numpy import median
import matplotlib.dates as mdates
from matplotlib.ticker import FuncFormatter
from matplotlib.dates import MonthLocator, DateFormatter
# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')
# Centre all the charts displayed in this notebook
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
""")
We will import the cleaned bike trip dataset that will be used for analysis.
#Import cleaned bike trip data
df_trips_data=pd.read_csv('df_merged_data.csv')
df_trips_data.set_index(keys='trip_id', drop=True, inplace=True)
df_trips_data.head()
| subscription_id | trip_duration | start_station_id | start_time | start_station_name | end_station_id | end_time | end_station_name | bike_id | user_type | ... | temp_c | dew_point_temp_c | rel_hum_ | wind_dir_10s_deg | wind_spd_kmh | visibility_km | stn_press_kpa | hmdx | wind_chill | weather | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| trip_id | |||||||||||||||||||||
| 712441 | NaN | 274 | 7006.0 | 2017-01-01 00:03:00-05:00 | Bay St / College St (East Side) | 7021.0 | 2017-01-01 00:08:00-05:00 | Bay St / Albert St | NaN | annual member | ... | 1.5 | -3.6 | 69.0 | 26.0 | 39.0 | 16.1 | 99.81 | NaN | NaN | clear_day |
| 712442 | NaN | 538 | 7046.0 | 2017-01-01 00:03:00-05:00 | Niagara St / Richmond St W | 7147.0 | 2017-01-01 00:12:00-05:00 | King St W / Fraser Ave | NaN | annual member | ... | 1.5 | -3.6 | 69.0 | 26.0 | 39.0 | 16.1 | 99.81 | NaN | NaN | clear_day |
| 712443 | NaN | 992 | 7048.0 | 2017-01-01 00:05:00-05:00 | Front St / Yonge St (Hockey Hall of Fame) | 7089.0 | 2017-01-01 00:22:00-05:00 | Church St / Wood St | NaN | annual member | ... | 1.5 | -3.6 | 69.0 | 26.0 | 39.0 | 16.1 | 99.81 | NaN | NaN | clear_day |
| 712444 | NaN | 1005 | 7177.0 | 2017-01-01 00:09:00-05:00 | East Liberty St / Pirandello St | 7202.0 | 2017-01-01 00:26:00-05:00 | Queen St W / York St (City Hall) | NaN | annual member | ... | 1.5 | -3.6 | 69.0 | 26.0 | 39.0 | 16.1 | 99.81 | NaN | NaN | clear_day |
| 712445 | NaN | 645 | 7203.0 | 2017-01-01 00:14:00-05:00 | Bathurst St / Queens Quay W | 7010.0 | 2017-01-01 00:25:00-05:00 | King St W / Spadina Ave | NaN | annual member | ... | 1.5 | -3.6 | 69.0 | 26.0 | 39.0 | 16.1 | 99.81 | NaN | NaN | clear_day |
5 rows × 31 columns
We will keep the backup of the trip data saved in a different variable here, in case we need to reset the applied manipulations.
#Backup of the dataframe
df_trips_data2=df_trips_data.copy()
Before diving into the analysis, it is important to understand and define the following properties about our dataset.
Structure - What is the format of our data file?
Scope - How complete is our data set?
Granularity - How fine or coarse is each row and column?
Some of these questions were already explored in the Part I (Data Cleaning and Wrangling), but we would like to reiterate the subject in more details in this section.
The structure of the datafile can be examined by looking at the "shape". We can see that there are 8,007,423 rows and 32 columns. This means we have over 8 million trip records in the database from Janyary 2017 to November 2020.
df_trips_data.shape
(8007423, 31)
What are the fields (e.g. columns) in each record? What is the type of each column? We can use .columns to examine this. We can see that there data about the trip and the associated information about the weather when the trip was taken. We joined the bike trip data and the weather dataframe together in Part I.
There are some redundant columns in the database, for example 'start_station_name' contains the same information as 'start_station_name_npl' except in the latter, the special characters have been removed. This is a relic column from the data wrangling process, but it has been left in here in case we prefer to access the station names without the special characters again.
Otherwise, it can be seen that there is information about the origin and destination of a trip (bike station name and id), their location coordinates, as well as information about the weather collected at the City of Toronto weather station at the time the trip was taken.
df_trips_data.columns
Index(['subscription_id', 'trip_duration', 'start_station_id', 'start_time',
'start_station_name', 'end_station_id', 'end_time', 'end_station_name',
'bike_id', 'user_type', 'start_station_name_npl',
'end_station_name_npl', 'start_station_lat', 'start_station_lon',
'end_station_lat', 'end_station_lon', 'merge_time', 'year', 'month',
'day', 'time', 'temp_c', 'dew_point_temp_c', 'rel_hum_',
'wind_dir_10s_deg', 'wind_spd_kmh', 'visibility_km', 'stn_press_kpa',
'hmdx', 'wind_chill', 'weather'],
dtype='object')
To summarize the data cleaning conducted in Part I, the records removed from the dataset include:
Below shows the number and percentage of cells with null values (missing data). All trip records contain information about the start and end stations and their respective location coordinates. There are trips missing 'bike_id' and 'subscription_id', but this is because these information were not available for the 2017 ad 2018 datasets. Some trip records are missing information about the weather.
#Percentage of Data Containing Null Record
trips_data_missing = pd.DataFrame(df_trips_data.isnull().sum())
trips_data_missing = trips_data_missing.rename(columns={0:"count"})
trips_data_missing['percent_nulldata']=round(trips_data_missing['count']/df_trips_data.shape[0] * 100,1)
trips_data_missing
| count | percent_nulldata | |
|---|---|---|
| subscription_id | 3199754 | 40.0 |
| trip_duration | 0 | 0.0 |
| start_station_id | 0 | 0.0 |
| start_time | 0 | 0.0 |
| start_station_name | 0 | 0.0 |
| end_station_id | 0 | 0.0 |
| end_time | 0 | 0.0 |
| end_station_name | 0 | 0.0 |
| bike_id | 3199754 | 40.0 |
| user_type | 0 | 0.0 |
| start_station_name_npl | 0 | 0.0 |
| end_station_name_npl | 0 | 0.0 |
| start_station_lat | 0 | 0.0 |
| start_station_lon | 0 | 0.0 |
| end_station_lat | 0 | 0.0 |
| end_station_lon | 0 | 0.0 |
| merge_time | 0 | 0.0 |
| year | 0 | 0.0 |
| month | 0 | 0.0 |
| day | 0 | 0.0 |
| time | 0 | 0.0 |
| temp_c | 48475 | 0.6 |
| dew_point_temp_c | 59007 | 0.7 |
| rel_hum_ | 57615 | 0.7 |
| wind_dir_10s_deg | 14177 | 0.2 |
| wind_spd_kmh | 14177 | 0.2 |
| visibility_km | 0 | 0.0 |
| stn_press_kpa | 50633 | 0.6 |
| hmdx | 5151443 | 64.3 |
| wind_chill | 7578399 | 94.6 |
| weather | 0 | 0.0 |
We use '.info' to see the datatype of each column contained the data_merged dataframe.
We can see that the 'start_time', and 'end_time' need to be converted into datetime objects again localized to the EST timezone.
We can see that the 'start_station_id', and 'end_station_id' can be converted into integers.
df_trips_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 8007423 entries, 712441 to 10293875 Data columns (total 31 columns): # Column Dtype --- ------ ----- 0 subscription_id float64 1 trip_duration int64 2 start_station_id float64 3 start_time object 4 start_station_name object 5 end_station_id float64 6 end_time object 7 end_station_name object 8 bike_id float64 9 user_type object 10 start_station_name_npl object 11 end_station_name_npl object 12 start_station_lat float64 13 start_station_lon float64 14 end_station_lat float64 15 end_station_lon float64 16 merge_time object 17 year int64 18 month int64 19 day int64 20 time object 21 temp_c float64 22 dew_point_temp_c float64 23 rel_hum_ float64 24 wind_dir_10s_deg float64 25 wind_spd_kmh float64 26 visibility_km float64 27 stn_press_kpa float64 28 hmdx float64 29 wind_chill float64 30 weather object dtypes: float64(17), int64(4), object(10) memory usage: 1.9+ GB
#Convert start and end time into tz-aware datetime object, localize to Eastern Standard Time
df_trips_data['start_time']=pd.to_datetime(df_trips_data['start_time'],infer_datetime_format=True)
df_trips_data['end_time']=pd.to_datetime(df_trips_data['end_time'],infer_datetime_format=True)
df_trips_data['merge_time']=pd.to_datetime(df_trips_data['merge_time'],infer_datetime_format=True)
df_trips_data['start_time']=df_trips_data['start_time'].dt.tz_convert(tz='US/Eastern')
df_trips_data['end_time']=df_trips_data['end_time'].dt.tz_convert(tz='US/Eastern')
df_trips_data['merge_time']=df_trips_data['merge_time'].dt.tz_convert(tz='US/Eastern')
#Verify that the datetime change is effected
print(df_trips_data['start_time'].dtype)
print(df_trips_data['end_time'].dtype)
print(df_trips_data['merge_time'].dtype)
df_trips_data.head()
datetime64[ns, US/Eastern] datetime64[ns, US/Eastern] datetime64[ns, US/Eastern]
| subscription_id | trip_duration | start_station_id | start_time | start_station_name | end_station_id | end_time | end_station_name | bike_id | user_type | ... | temp_c | dew_point_temp_c | rel_hum_ | wind_dir_10s_deg | wind_spd_kmh | visibility_km | stn_press_kpa | hmdx | wind_chill | weather | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| trip_id | |||||||||||||||||||||
| 712441 | NaN | 274 | 7006.0 | 2017-01-01 00:03:00-05:00 | Bay St / College St (East Side) | 7021.0 | 2017-01-01 00:08:00-05:00 | Bay St / Albert St | NaN | annual member | ... | 1.5 | -3.6 | 69.0 | 26.0 | 39.0 | 16.1 | 99.81 | NaN | NaN | clear_day |
| 712442 | NaN | 538 | 7046.0 | 2017-01-01 00:03:00-05:00 | Niagara St / Richmond St W | 7147.0 | 2017-01-01 00:12:00-05:00 | King St W / Fraser Ave | NaN | annual member | ... | 1.5 | -3.6 | 69.0 | 26.0 | 39.0 | 16.1 | 99.81 | NaN | NaN | clear_day |
| 712443 | NaN | 992 | 7048.0 | 2017-01-01 00:05:00-05:00 | Front St / Yonge St (Hockey Hall of Fame) | 7089.0 | 2017-01-01 00:22:00-05:00 | Church St / Wood St | NaN | annual member | ... | 1.5 | -3.6 | 69.0 | 26.0 | 39.0 | 16.1 | 99.81 | NaN | NaN | clear_day |
| 712444 | NaN | 1005 | 7177.0 | 2017-01-01 00:09:00-05:00 | East Liberty St / Pirandello St | 7202.0 | 2017-01-01 00:26:00-05:00 | Queen St W / York St (City Hall) | NaN | annual member | ... | 1.5 | -3.6 | 69.0 | 26.0 | 39.0 | 16.1 | 99.81 | NaN | NaN | clear_day |
| 712445 | NaN | 645 | 7203.0 | 2017-01-01 00:14:00-05:00 | Bathurst St / Queens Quay W | 7010.0 | 2017-01-01 00:25:00-05:00 | King St W / Spadina Ave | NaN | annual member | ... | 1.5 | -3.6 | 69.0 | 26.0 | 39.0 | 16.1 | 99.81 | NaN | NaN | clear_day |
5 rows × 31 columns
#convert the station ids into integers
df_trips_data['end_station_id']=df_trips_data['end_station_id'].astype(int)
df_trips_data['start_station_id']=df_trips_data['start_station_id'].astype(int)
Granularity refers to how fine or coarse is each datum. The raw data that we will be using in our exploratory data analysis, contained in the 'df_trips_data' dataframe contains one trip per row. Throughout the analysis, we will use the '.groupby.agg()' function to manipulate the granularity to obtain, for example, the daily ride counts or hourly ride counts.
Below we use the .groupby.agg() function to determine the daily ride count by date. We also broke the trip dataset down by the membership type to determine if the observed trend is the same for both casual and annual member riders.
#Use the start_station_id as proxy for estimating total number of trips on each date because we know every trip
#has a start and end station id
# Determine the daily trip count for every date by using the groupby function, this is for all membership type
df_usage_perd_bymem = df_trips_data.groupby(pd.Grouper(key="start_time",
freq='D')).agg(trip_count=('start_station_id',"count"),
annual_members=('user_type',
lambda x: (x == 'annual member').sum()),
casual_members=('user_type',
lambda x: (x == 'casual member').sum()))
#Create new column field extracting year, day of year, day, month,weekday from Datetime index
df_usage_perd_bymem['year']=df_usage_perd_bymem.index.year
df_usage_perd_bymem['dayofyear']=df_usage_perd_bymem.index.dayofyear
df_usage_perd_bymem['day']=df_usage_perd_bymem.index.day
df_usage_perd_bymem['month']=df_usage_perd_bymem.index.month
df_usage_perd_bymem['dayofweek'] = df_usage_perd_bymem.index.dayofweek
df_usage_perd_bymem['isweek_day'] = df_usage_perd_bymem.index.weekday <5
df_usage_perd_bymem.loc[df_usage_perd_bymem['isweek_day'], 'isweek_day'] = 'weekday'
df_usage_perd_bymem.loc[df_usage_perd_bymem['isweek_day'] ==False,'isweek_day'] = 'weekend'
df_usage_perd_bymem.head()
| trip_count | annual_members | casual_members | year | dayofyear | day | month | dayofweek | isweek_day | |
|---|---|---|---|---|---|---|---|---|---|
| start_time | |||||||||
| 2017-01-01 00:00:00-05:00 | 482 | 412 | 70 | 2017 | 1 | 1 | 1 | 6 | weekend |
| 2017-01-02 00:00:00-05:00 | 826 | 756 | 70 | 2017 | 2 | 2 | 1 | 0 | weekday |
| 2017-01-03 00:00:00-05:00 | 871 | 853 | 18 | 2017 | 3 | 3 | 1 | 1 | weekday |
| 2017-01-04 00:00:00-05:00 | 1395 | 1361 | 34 | 2017 | 4 | 4 | 1 | 2 | weekday |
| 2017-01-05 00:00:00-05:00 | 1210 | 1191 | 19 | 2017 | 5 | 5 | 1 | 3 | weekday |
As part of this question, we will determine if there is a pattern in the bike share usage over the year, week and the day.
Figure 1 shows the total ride counts by year in the df_trips_data database. There is a increasing number of rides every year. The % increase corresponds to 35% from 2017 to 2018, 27% from 2018 to 2019 and 6% from 2019 to 2020. The percentage increase may be slighly underestimated for 2020 as it does not include the ride counts from November and December 2020, but the increase was probably still not as significant as the previous years.
Then, we will look at the yearly trend by plotting the daily trip count found in df_usage_perd_by_mem against the dayofyear. The dayofyear column returns the day of the year on which the particular date occurs.
In this plot titled Figure 2A - Daily Ride Counts by Day of Year between 2017 to 2020, we see that there is a cyclic peak between Day 180 and 270, which correspond to Juy - October. However we can see that there is quite a bit of day to day fluctuations that make the annual pattern hard to decipher.
In the subsequent plot, Figure 2B - Average Daily Ride Counts (with 95% Confidence interval) by Month Between 2017 and 2020, we plot the average daily ride count per month between 2017 and 2020. We can see a distinct peak in the summer around July and August, and usage tend to be lowest in the winter months between December and March.
#Set Up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.5)
#plot the graph for daily trip count for all, casual and annual members
ax = sns.countplot(data=df_trips_data,x='year', )
ax.axes.set_title("Figure 1 - Total Ride Counts by Year")
ax.set_xlabel("Year")
ax.set_ylabel("Ride Count")
plt.show()
#Set up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.5)
#Use lineplot to plot the daily ride count vs day of year for each year (2017 - 2020)
ax=sns.lineplot(data=df_usage_perd_bymem, x="dayofyear", y="trip_count", hue="year", err_style = 'band', ci=95)
ax.axes.set_title("Figure 2A - Daily Ride Counts by Day of Year between 2017 to 2020",
fontsize=16)
ax.xaxis.set_major_locator(ticker.MultipleLocator(30))
ax.set_ylabel("Daily Ride Counts")
ax.set_xlabel("Day of Year")
ax.legend(title="Year")
plt.show()
#Set up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.5)
#Use lineplot to plot the average daily ride count by month for all users
ax=sns.lineplot(data=df_usage_perd_bymem, x="month", y="trip_count", hue="year", err_style = 'band', ci=95)
ax.axes.set_title("Figure 2B - Average Daily Ride Counts (with 95% Confidence interval) by Month Between 2017 and 2020",
fontsize=16)
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.set_ylabel("Average Daily Rides",
fontsize=16)
ax.set_xlabel("Month",
fontsize=16)
ax.legend(title="Year",
fontsize=14)
plt.show()
Now, we will examine the weekly trend by plotting the daily trip count found in df_usage_perd_by_mem against the dayofweek. The dayofweek column returns the day of the week on which the particular date occurs.
In the plot titled Figure 3A - Average Daily Ride Counts by Day of Week, we see that Wednesday seems to be the most popular day in terms of the average daily ride count while Sunday is the least popular. The similar is observed for Figure 3B - Total Ride Counts by Day of Week which shows the total ride count by day of the week.
In the subsequent plot, Figure 4 -Average Daily Ride Counts by Day of Week by Membership Type, we plot the average daily ride count by Day of Week and Membership Type. We start to see a different trend emerge for different membership types. The annual members tend to use the bikes more on the weekdays, while the casual members tend to use the bikes more on the weekends. However, wednesday seems to be the most popular weekday for both member types. Because there is significantly more rides taken by the annual members (rides by annual members make 77% of the total trips included in df_trips_daya), the overall trend observed for all member types is similar to the trend observed for annual members.
The differences between the membership types will be revisited again and examined in more detail as part of Question 2 found in this notebook.
#Map the numeric day of week values into string
dayOfWeek={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
df_usage_perd_bymem['weekday'] = df_usage_perd_bymem.index.dayofweek.map(dayOfWeek)
#Set Up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.5)
#plot the graph for daily trip count for all, casual and annual members
ax = sns.barplot(x=df_usage_perd_bymem.weekday,y=df_usage_perd_bymem.trip_count,
order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
ax.axes.set_title("Figure 3A - Average Daily Ride Counts (with 95% CI) by Day of Week")
ax.set_xlabel("Day of Week")
ax.set_ylabel("Average Daily Rides")
plt.show()
#Map the numeric day of week values into string
dayOfWeek={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
df_trips_data['weekday'] = df_trips_data.start_time.dt.dayofweek.map(dayOfWeek)
#Set Up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.5)
#plot the graph for daily trip count for all, casual and annual members
ax = sns.countplot(data=df_trips_data,x='weekday', order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
ax.axes.set_title("Figure 3B - Total Ride Counts by Day of Week")
ax.set_xlabel("Day of Week")
ax.set_ylabel("Ride Count")
plt.show()
#Create intermediate dataframe created using melt to produce a "long-form" dataframe.
tidy = pd.melt(df_usage_perd_bymem, id_vars=['weekday'], value_vars=['trip_count','casual_members','annual_members'], var_name='member_type')
tidy.replace({'trip_count':'all_members'}, inplace=True)
#create a new column that will be used for category
plt.figure(figsize=(12,8))
sns.set(font_scale=1.2)
#plot the graph for daily trip count for all, casual and annual members
ax = sns.barplot(data=tidy, x="weekday",y="value", hue="member_type",
order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
ax.axes.set_title("Figure 4 - Average Daily Ride Counts (with 95% CI) by Day of Week and Membership Type")
ax.set_xlabel("Day of Week")
ax.set_ylabel("Average Daily Rides")
plt.show()
#Delete dummy variable
del tidy
Now, we will examine the daily bike usage by looking at the distribution of bike rides by the hour of the day.
In this plot titled Figure 5 - Bike Usage Throughout Day, we are looking at the probability density function of the bike trip counts throughout the day. We can see two peaks in the morning between 8h30 and 11AM, then a much bigger peak again between 5PM and 8PM. There evidence to suggest that many of the bike users uitlize the bike share program for commuting purposed. For this reason, we see major peaks during the morning and evening rush hour. It is our belief that the evening peak is also much bigger than the morning peak because most of the recreational rides also occur in the afternoon and evening. This will become more evident in subsequent analysis when we analyze the differenced between the annual and casual members.
#Calculate the hour of day for each trip based on the start time
df_trips_data['hour']=[(x.hour+x.minute/60+x.second/60**2) for x in df_trips_data['start_time']]
#Set up plot
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
ax2.set_xlim(0,24)
sns.set(font_scale=1.2)
#Plot the distribution of ride counts based on time of day
sns.distplot(df_trips_data['hour'], hist = False)
#formate titles
ax2.set_title("Figure 5 - Bike Usage Throughout Day")
ax2.set_xlabel("Hour of the Day")
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(1))
plt.show()
In this question, we will look at the trends that we have begun to identify as part of Question 1, and determine whether these trends hold for both types of members: casual and annual. Annual members are people who pay for their annual pass, which gives the user unlimited 30 minute rides for the year. On the other hand, casual members are people who opt to pay for their bike ride per use.
As part of this question, we explored the following questions:
1) How does the daily ride count differ betwween casual and annual riders?
2) How does the bike usage differ over the week for casual and annual riders?
3) Is there a change in usage whether it is a weekday or weekend for casual and annual members?
4) Is there a change in usage by time of day for casual and annual members?
5) Is there a difference in trip duration for casual and annual members?
We believe that the annual members have obtained their membership because they use the bike share on a regular basis. For this reason, they would be taking the bike trips mainly for commuting in additional some recreational trip they take after work hours or on the weekends. On the other hand, casual members do not rely on the bikes to get from Point A to Point B on a regular basis, so bike rides are not used for commuting like the annual members. Because the intentions of the annual and casual members are different, we anticipated that the usage will differ by membership type.
Based on the calculation below, trips taken by annual members make up the bulk of the trips contained in 'df_trips_data'.
print("Percentage of trips taken by annual members:" , round(df_trips_data[df_trips_data['user_type']=="annual member"].shape[0]/df_trips_data.shape[0]*100,2))
print("Percentage of trips taken by casual members:" , round(df_trips_data[df_trips_data['user_type']=="casual member"].shape[0]/df_trips_data.shape[0]*100,2))
Percentage of trips taken by annual members: 76.68 Percentage of trips taken by casual members: 23.32
Figure 6 - Distribution of Daily Ride Counts for Casual Members and Annual Members from 2017 to 2020 shows the probability density distribution of daily ride counts for annual and casual members. Overall as expected, annual members tend to take more bike trips on any given day compared to the casual riders.
#Set up plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.2)
#Plot histogram and kde for distribution of daily ride counts for casual and annual members
ride_count=sns.distplot(df_usage_perd_bymem['casual_members'],
label ="Casual")
sns.distplot(df_usage_perd_bymem['annual_members'],
label="Annual")
ride_count.axes.set_title("Figure 6 - Distribution of Daily Ride Counts for Casual Members and Annual Members from 2017 to 2020",
fontsize=16)
ride_count.set_xlabel("Daily Trip Count")
ride_count.set_xlim(0, 12000)
ride_count.set_ylabel("Probability Density")
ride_count.legend()
plt.show()
The difference in bike usage over the week is determined for casual and annual riders by looking at the total number of rides by day of week and membership type. As shown in Figure 7 - Total Ride Counts by Day of Week and Membership Type , annual riders tend to use the bike more over the weekday while casual riders prefer to use the bike over the weekend. The results shown here is similar to what was observed in Figure 4 .
#Set Up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.2)
#plot the graph for daily trip count for all, casual and annual members
ax = sns.countplot(data=df_trips_data,x='weekday', hue='user_type', order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
ax.axes.set_title("Figure 7 - Total Ride Counts by Day of Week and Membership Type")
ax.set_xlabel("Day of Week")
ax.set_ylabel("Ride Count")
ax.legend(title="Member Type")
plt.show()
The scatter plot below Figure 8 - Comparison of Daily Ride Counts for Casual Members and Annual Members on Weekday or Weekend compared the daily ride count for casual and annual riders on any given date. On the weekdays, there is an increase in the proportion of bike rides taken by annual members compared to casual riders. On the weekend, the number of bike rides by annual riders proportionally decreases compared to the number of casual riders. This trend was also observed in the barplot shown in Figure 4 .
#Set up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.2)
#Scatter plot looking and distribution of daily ride counts by membership type and day of week
ride_scatter=sns.scatterplot(x=df_usage_perd_bymem[df_usage_perd_bymem["isweek_day"]=='weekday']['casual_members'],
y=df_usage_perd_bymem[df_usage_perd_bymem["isweek_day"]=='weekday']['annual_members'],
label="Weekday")
sns.scatterplot(x=df_usage_perd_bymem[df_usage_perd_bymem["isweek_day"]=='weekend']['casual_members'],
y=df_usage_perd_bymem[df_usage_perd_bymem["isweek_day"]=='weekend']['annual_members'],
label="Weekend")
ride_scatter.axes.set_title("Figure 8 - Comparison of Daily Ride Counts for Casual Members and Annual Members on \n on Weekday or Weekend",
fontsize=16)
ride_scatter.set_xlabel("Casual Membership",
fontsize=16)
#ride_scatter.set_ylim(0, 0.00035)
#ride_count.set_xlim(0, 14000)
ride_scatter.set_ylabel("Annual Membership")
ride_scatter.legend()
plt.show()
Considering the trend in behavour over the weekday and weekend, two trends are noted, one for annual members and one for causal members. Based on the figure below titled Figure 9 - Average Daily Ride Counts for Weekday or Weekend By Membership Type and Year. Annual members show a higher frequency of usage during the week than on the weekends while casual members display an inverse trend with a higher usage during the weekends compared to weekdays.
It is interesting to note that the gap between the weekday and weekend usage decreases in 2020 for the annual riders. We think this is partially due to the impact of COVID-19, which caused a lot of people to work from home. This eliminated the need for people to commute on a regular basis. For this reason, the weekday usage for annual members did not go up significantly compared to the previous year, while the usage over the weekend continued to increase. We also see in 2020 that the weekend usage went up significantly for casual members. We speculate that bike share was considered by many people to be a safe recreational activity in the midst of the COVID lockdowns experienced in Toronto.
#Create intermediate dataframe created using melt to produce a "long-form" dataframe.
tidy = pd.melt(df_usage_perd_bymem, id_vars=['year','isweek_day'], value_vars=['trip_count','casual_members','annual_members'], var_name='member_type')
tidy.replace({'trip_count':'all_members'}, inplace=True)
#Set up plot
fig,axm = plt.subplots(2,1,figsize = (25,25))
sns.set(font_scale=4)
user_type_str = 'casual'
tidy['member_week_day'] = tidy['member_type'] +', '+tidy['isweek_day']
#Plot the graph for casual members, bar plots showing average daily ride counts for weekday and weekend
ax = sns.barplot(ax=axm[0], x=tidy[tidy['member_type']=="casual_members"].year,
y=tidy[tidy['member_type']=="casual_members"].value,
hue=tidy[tidy['member_type']=="casual_members"].member_week_day,
hue_order=['casual_members, weekday','casual_members, weekend'])
ax.axes.set_title("Average Number of Daily Rides for Weekday or Weekend for Casual Members")
ax.set_xlabel("Year")
ax.set_ylabel("Average Daily Ride Counts")
ax.legend()
user_type_str = 'annual'
#plot the graph for annual members only
ax = sns.barplot(ax=axm[1], x=tidy[tidy['member_type']=="annual_members"].year,
y=tidy[tidy['member_type']=="annual_members"].value,
hue=tidy[tidy['member_type']=="annual_members"].member_week_day,
hue_order=['annual_members, weekday','annual_members, weekend'])
ax.axes.set_title("Average Number of Daily Rides for Weekday or Weekend for Annual Members")
ax.set_xlabel("Year")
ax.set_ylabel("Average Daily Ride Counts")
ax.legend()
#formate titles
fig.suptitle("Figure 9 - Average Daily Ride Counts for Weekday or Weekend By Membership Type and Year")
plt.show()
#delete dummy variable
del tidy
The graph below titled Figure 10 - Hourly Distribution of Ride Counts for Casual Members and Annual Members depicts the hourly ride counts by membership type. For annual members, there are distinct peaks during rush hour indicating the high demand for bikes during typical commuting times (8-11AM and 5-8PM). For casual rider, the demands tend to start increasing in the afternoon and peaks in the evening between 6 - 8PM.
#Set up plot
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
sns.set(font_scale=1.2)
ax2.set_xlim(0,24)
#map data
sns.distplot(df_trips_data[df_trips_data['user_type'] == 'casual member']['hour'], label = "Casual",ax = ax2, hist = False)
sns.distplot(df_trips_data[df_trips_data['user_type'] == 'annual member']['hour'], label = "Annual",ax = ax2, hist = False)
#formate titles
ax2.set_title("Figure 10 - Hourly Distribution of Ride Counts for Casual Members and Annual Members")
ax2.set_xlabel("Hourly Ride Counts")
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax2.legend(title='Membership Type')
plt.show()
It is interesting to note that based on the Figure below ( Figure 11 - Trip Duration Distribution for Casual Members and Annual Members ) looking at the distribution of the trip duration for casual and annual members, it shows that casual members tend to take longer trips compared to annual members. This is likely due to the intent of the annual and casual riders. Annual riders use the bikeshare to get from Point A to Point B and they have unlimited 30 minute rides so they do not feel the need to prolong the ride duration. On the other hand, casual members tend to use the bike share for more recreational purposes and they pay per 30 minute ride, so they may be motivated to prolong the trip duration as long as they can without incurring additional fees.
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
sns.set(font_scale=1.2)
ax2.set_xlim(0,36)
#map data
sns.distplot(df_trips_data[df_trips_data['user_type'] == 'casual member']['trip_duration']/60, label = "Casual",ax = ax2, hist = False)
sns.distplot(df_trips_data[df_trips_data['user_type'] == 'annual member']['trip_duration']/60, label = "Annual",ax = ax2, hist = False)
#formate titles
ax2.set_title("Figure 11 - Trip Duration Distribution for Casual Members and Annual Members",fontsize = 18)
ax2.set_xlabel("Trip Duration (min)", fontsize = 16)
ax2.set_ylabel("Probability Density", fontsize = 16)
ax2.xaxis.set_major_locator(ticker.MultipleLocator(2))
ax2.legend(title='Membership Type')
plt.show()
In the first part of the question, intially we looked at lineplot ( Figure 12 - Number of Rides Per Day between 2017 to 2020 ) to show the daily trip count between 2017 and 2020 for all member types. We can definitely see an increasing trend over time as the peak gets bigger each year, but the trend is slightly confounded by day-to-day fluctuations in the daily trip count. It is interesting to note that the fluctuations in 2020 were more significant compared to previous years.
We also looked at the average daily bike count for each year to determine if there are any changes to the average. If there is an increasing trend, there should be an corresponding increase in the average from 2017 to 2020. In Figure 13A - Average Daily Ride Counts by Year , we can see an increase in the average daily bike ride, which suggests that there is an increasing trend of usage from 2017 to 2020 on a daily basis. When you look at Figure 13B- Average Monthly Ride Count by Year , There is a large confidence interval associated with the monthly average. This is attributed to the large daily fluctuations in the rider count, but there was also a prolonged time period at the start of the pandemic between March and May when the ride counts were extremely low.
#Set up plot
plt.figure(figsize=(12,5))
#lineplot for all member types, daily trip count by date
usage=sns.lineplot(x=df_usage_perd_bymem.index,y=df_usage_perd_bymem['trip_count'])
usage.axes.set_title("Figure 12 - Daily Ride Counts Between 2017 to 2020",
fontsize=16)
usage.set_ylabel("Daily Rides",
fontsize=16)
usage.set_xlabel("Date",
fontsize=16)
#format axis
# Minor ticks every month.
fmt_month = mdates.MonthLocator(interval=1)
#define function to return first letter of every month
month_fmt = DateFormatter('%b')
def m_fmt(x, pos=None):
return month_fmt(x)[0]
usage.xaxis.set_minor_locator(MonthLocator())
usage.xaxis.set_minor_formatter(FuncFormatter(m_fmt))
# Major ticks every year
years = mdates.YearLocator()
#ax.xaxis.set_major_locator(years)
yearsFmt = mdates.DateFormatter('\n\n%Y') # add some space for the year label
usage.xaxis.set_major_formatter(yearsFmt)
plt.show()
#Set up plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.2)
#Plot showing the average daily rides by year
ax = sns.barplot(x=df_usage_perd_bymem.year, y=df_usage_perd_bymem.trip_count)
ax.axes.set_title("Figure 13A - Average Daily Ride Count by Year",
fontsize=16)
ax.set_xlabel("Year", fontsize=16)
ax.set_ylabel("Average Daily Ride Counts",
fontsize=16)
ax.yaxis.set_major_locator(ticker.MultipleLocator(1000))
plt.show()
#Create intermediate dataframe
tidy=df_usage_perd_bymem.groupby(['year','month']).sum()
tidy=tidy.reset_index(level=0)
#Set up plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.2)
#Plot showing the average daily rides by year
ax = sns.barplot(x=tidy.year, y=tidy.trip_count)
ax.axes.set_title("Figure 13B - Average Monthly Ride Count by Year",
fontsize=16)
ax.set_xlabel("Year")
ax.set_ylabel("Average Monthly Ride Counts")
plt.show()
del tidy
Comparisons of average daily ride counts by month over the years ( Figure 14 - Average Daily Rides (with 95% Confidence interval) by Month between 2017 to 2020 ) show that there is increasing trend of usage from 2017 to 2020, particularly between May and October. The trend is not as obvious during the colder months between November and April.
Keep in mind that we did not have any bike trip data for November and December 2020. For this reason, there comparison is not available for November and December for Year 2020.
plt.figure(figsize=(10,5))
ax=sns.lineplot(data=df_usage_perd_bymem, x="month", y="trip_count", hue="year", err_style = 'band', ci=95)
ax.axes.set_title("Figure 14 - Average Daily Rides (with 95% Confidence interval) by Month between 2017 to 2020",
fontsize=16)
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.set_ylabel("Average Daily Rides",
fontsize=16)
ax.set_xlabel("Month",
fontsize=16)
ax.legend(title="Year",
fontsize=14)
plt.show()
We also looked at lineplot ( Figure 15 - Daily Ride Counts for Annual and Casual Members Between 2017 to 2020 ) to show the daily trip count between 2017 and 2020 for annual and casual members For both member types, we can definitely see a increasing trend over time as the peak gets bigger each year, but the trends are again slightly confounded by day-to-day fluctuations in the daily trip count.
#Set up plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.2)
#lineplot for causal and annual membership, daily trip count by date
usage=sns.lineplot(x=df_usage_perd_bymem.index,y=df_usage_perd_bymem['annual_members'], label ="annual_members")
sns.lineplot(x=df_usage_perd_bymem.index,y=df_usage_perd_bymem['casual_members'], label="casual members")
usage.axes.set_title("Figure 15 - Daily Ride Counts for Annual and Casual Members Between 2017 to 2020")
usage.set_ylabel("Daily Rides")
usage.set_xlabel("Date")
usage.legend()
#format axis
# Minor ticks every month.
fmt_month = mdates.MonthLocator(interval=1)
#define function to return first letter of every month
month_fmt = DateFormatter('%b')
def m_fmt(x, pos=None):
return month_fmt(x)[0]
usage.xaxis.set_minor_locator(MonthLocator())
usage.xaxis.set_minor_formatter(FuncFormatter(m_fmt))
# Major ticks every year
years = mdates.YearLocator()
#ax.xaxis.set_major_locator(years)
yearsFmt = mdates.DateFormatter('\n\n%Y') # add some space for the year label
usage.xaxis.set_major_formatter(yearsFmt)
plt.show()
Comparisons of average daily ride counts by month over the years is shown for each member type in Figure 16 and Figure 17.
Based on these graphs, we can see that there is a increase in the useage for casual and anual members from 2017 to 2020. When we breakdown the daily ride count by membership type, however, we can see that the biggest jumps in useage was between 2018-2019 and 2019-2020 for casual members. For the annual members, the usage jumps were more significant between 2017-2018 and 2018-2019 to a lesser degree. As such, there is some differences in the trend for annual and casual riders.
We believe that we did not see a big increase in the annual riders between 2019 and 2020, as we saw for the casual riders, because of the impact of covid-19. The uncertainty around covid-19 in 2020 and the pressure to work from home reduced the number of people who needed to commute to the offices. Because the need for commuting reduced, 2020 was not a big year for annual riders. The impact of COVID-19 will be examined in more detail in Question 5.
# Same plot but only for casual members
plt.figure(figsize=(10,5))
ax2=sns.lineplot(data=df_usage_perd_bymem, x="month", y="casual_members", hue="year", err_style = 'band', ci=95)
ax2.axes.set_title("Figure 16 - Average Daily Rides (with 95% Confidence interval) by Month between 2017 to 2020 for Casual Members",
fontsize=16)
ax2.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax2.set_ylabel("Average Daily Rides",
fontsize=16)
ax2.set_xlabel("Month",
fontsize=16)
ax2.legend(title="Year",
fontsize=14)
plt.show()
# Same plot but only for annual members
plt.figure(figsize=(10,5))
ax3=sns.lineplot(data=df_usage_perd_bymem, x="month", y="annual_members", hue="year", err_style = 'band', ci=95)
ax3.axes.set_title("Figure 17 - Average Daily Rides (with 95% Confidence interval) by Month between 2017 to 2020 for Annual Members",
fontsize=16)
ax3.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax3.set_ylabel("Average Daily Rides",
fontsize=16)
ax3.set_xlabel("Month",
fontsize=16)
ax3.legend(title="Year",
fontsize=14)
plt.show()
The popularity of FREE RIDE WEDNESDAYS can be assessed by looking at the change the ride count over the course of the week. If it is popular, we should see more rides being taken on Wednesdays.
If the FREE WEDNESDAY promotion was popular, we would anticipate that it would be popular among the casual members, but not necessarily the annual members. The annual members already have unlimited 30 minute bike rides for the year, so they do not need to take advantage of the FREE RIDE WEDNESDAY.
FREE RIDE WEDNESDAY is a promotion offered once a month every year, where the bikeshare bikes become free on the wednesdays of that promo month. In 2020, article this promotion was only offered in the month of September for 2020. This article would suggest that the promotion was offered in August for 2019. In 2018, it was offered in the month of June. In 2017, it was offered in the month of July.
We will narrow our dataset to these specific months for which we know FREE RIDE WEDNESDAY promotion was offered.
#subset the dataframe to only include time period for which FREE RIDE WEDNESDAY promotion was offered
subset2020=df_usage_perd_bymem[(df_usage_perd_bymem.index.year==2020)&(df_usage_perd_bymem.index.month==9)]
subset2019=df_usage_perd_bymem[(df_usage_perd_bymem.index.year==2019)&(df_usage_perd_bymem.index.month==8)]
subset2018=df_usage_perd_bymem[(df_usage_perd_bymem.index.year==2018)&(df_usage_perd_bymem.index.month==6)]
subset2017=df_usage_perd_bymem[(df_usage_perd_bymem.index.year==2017)&(df_usage_perd_bymem.index.month==7)]
#Combine the dataset
freeride=pd.concat([subset2020,subset2019,subset2018,subset2017])
freeride.head()
| trip_count | annual_members | casual_members | year | dayofyear | day | month | dayofweek | isweek_day | weekday | |
|---|---|---|---|---|---|---|---|---|---|---|
| start_time | ||||||||||
| 2020-09-01 00:00:00-04:00 | 13924 | 10017 | 3907 | 2020 | 245 | 1 | 9 | 1 | weekday | Tuesday |
| 2020-09-02 00:00:00-04:00 | 14803 | 9536 | 5267 | 2020 | 246 | 2 | 9 | 2 | weekday | Wednesday |
| 2020-09-03 00:00:00-04:00 | 14673 | 10331 | 4342 | 2020 | 247 | 3 | 9 | 3 | weekday | Thursday |
| 2020-09-04 00:00:00-04:00 | 15575 | 10345 | 5230 | 2020 | 248 | 4 | 9 | 4 | weekday | Friday |
| 2020-09-05 00:00:00-04:00 | 18436 | 9227 | 9209 | 2020 | 249 | 5 | 9 | 5 | weekend | Saturday |
Figure 18 - Average Daily Rides (with 95% Confidence Interval) by Day of Week and Year for Months with FREE RIDE WEDNESDAY promotion shown below demonstrates that FREE RIDE WEDNESDAY promotions do significantly increase the number of rides taken on Wednesdays. For 2017 - 2019, the frequency of rides taken more than doubles compared to other weekdays. For 2020, the frequency does not quite double, but there is significantly more rides taken on the Wednesdays when the promotion is available.
plt.figure(figsize=(12,8))
sns.set(font_scale=1.5)
ax=sns.barplot(data= freeride, x='weekday',y='casual_members',hue='year',
order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
ax.axes.set_title('Figure 18 - Average Daily Rides (with 95% Confidence Interval) by Day of Week and Year \n for Months with FREE RIDE WEDNESDAY promotion')
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.set_ylabel("Average Daily Rides")
ax.set_xlabel("Day of Week")
ax.legend(title='Year')
plt.show()
Figure 19 - Average Daily Rides (with 95% Confidence Interval) Comparing Months with and without FREE RIDE WEDNESDAY Promotion compares the average daily ride count for months with and without FREE RIDE WEDNESDAY promotion. It is evident that the promotion does significantly increase the number of rides taken on Wednesdays. When the promotion is available, the trip count is 5x greater compared to when the promotion is not available. When the promotion is available, the ride frequency on Wednesdays exceeds that on the weekends.
#Subset for dates when FREE RIDE WEDNESDAY promotion was not available
subset2020=df_usage_perd_bymem[df_usage_perd_bymem.index.year==2020]
subset2020=subset2020[~(subset2020.index.month==9)]
subset2019=df_usage_perd_bymem[df_usage_perd_bymem.index.year==2019]
subset2019=subset2019[~(subset2019.index.month==8)]
subset2018=df_usage_perd_bymem[df_usage_perd_bymem.index.year==2018]
subset2018=subset2018[~(subset2018.index.month==6)]
subset2017=df_usage_perd_bymem[df_usage_perd_bymem.index.year==2017]
subset2017=subset2017[~(subset2017.index.month==7)]
#Combine all the dataset with no promo
nofreeride=pd.concat([subset2020,subset2019,subset2018,subset2017])
#Assign the dates a new column indicating 'promo' or 'no promo'
nofreeride['promo']='no promo'
freeride['promo']='promo'
#Combine all dataset with promo and no promo
all_data = pd.concat([nofreeride,freeride])
all_data.head()
| trip_count | annual_members | casual_members | year | dayofyear | day | month | dayofweek | isweek_day | weekday | promo | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| start_time | |||||||||||
| 2020-01-01 00:00:00-05:00 | 1321 | 1103 | 218 | 2020 | 1 | 1 | 1 | 2 | weekday | Wednesday | no promo |
| 2020-01-02 00:00:00-05:00 | 3715 | 3436 | 279 | 2020 | 2 | 2 | 1 | 3 | weekday | Thursday | no promo |
| 2020-01-03 00:00:00-05:00 | 4335 | 3950 | 385 | 2020 | 3 | 3 | 1 | 4 | weekday | Friday | no promo |
| 2020-01-04 00:00:00-05:00 | 2356 | 2163 | 193 | 2020 | 4 | 4 | 1 | 5 | weekend | Saturday | no promo |
| 2020-01-05 00:00:00-05:00 | 1941 | 1813 | 128 | 2020 | 5 | 5 | 1 | 6 | weekend | Sunday | no promo |
plt.figure(figsize=(15,10))
sns.set(font_scale=1.5)
ax=sns.barplot(data= all_data, x='weekday',y='casual_members',hue='promo',
order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
ax.axes.set_title("Figure 19 - Average Daily Rides (with 95% Confidence Interval) Comparing Months \n with and without FREE RIDE WEDNESDAY Promotion")
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.set_ylabel("Average Daily Rides")
ax.set_xlabel("Day of Week")
ax.legend(title='FREE RIDE WED PROMOTION')
plt.show()
#the average daily rides when the FREE RIDE WEDNESDAY promotion is available
x=all_data[(all_data['weekday']=='Wednesday')&(all_data['promo']=='promo')]['casual_members'].agg('mean')
print("The average daily ride count for Wednesdays when FREE RIDE FRIDAY promotion is available:", round(x))
#the average daily rides when the FREE RIDE WEDNESDAY promotion is not available
x=all_data[(all_data['weekday']=='Wednesday')&(all_data['promo']=='no promo')]['casual_members'].agg('mean')
print("The average daily ride count for Wednesdays when FREE RIDE FRIDAY promotion is not available:", round(x))
The average daily ride count for Wednesdays when FREE RIDE FRIDAY promotion is available: 5073 The average daily ride count for Wednesdays when FREE RIDE FRIDAY promotion is not available: 885
As part of this question, we need to examine how the bike usage changed in 2020 compared to all the other previous years. We will answer the following questions to determine how the COVID-19 pandemic and government-mandated lockdowns impacted the bike share usage.
Did the increasing trend in bike usage seen over the past years also seen in 2020?
How did the hourly distribution of ride counts change in 2020 compared to previous years?
Has the pandemic impacted the bike trip duration?
Is there an increase in trips where the start and end locations are the same?
The conclusion above is further reiterated in Figure 20C as we see a big decrease in the total number of rides taken by annual members compared to casual members in 2020. However, it seems this trend was also somewhat observed in 2019. It would be interesting to see the total annual membership numbers have been increasing or decreasing over the years to determine if a decreasing member is the cause for this trend.
#Create intermediate dataframe created using melt to produce a "long-form" dataframe.
tidy = pd.melt(df_usage_perd_bymem, id_vars=['year'], value_vars=['casual_members','annual_members'], var_name='member_type')
#Setup plot
plt.figure(figsize=(20,10))
sns.set(font_scale=2)
#plot barplot showing average ride count by year and membership type
ax = sns.barplot(x=tidy.year,
y=tidy.value,
hue=tidy.member_type)
ax.axes.set_title("Figure 20A - Average Daily Ride Count by Year and Membership Type")
ax.set_xlabel("Year")
ax.set_ylabel("Average Daily Rides")
ax.legend(title="Membership Type")
plt.setp(ax.get_legend().get_texts(), fontsize='16') # for legend text
plt.setp(ax.get_legend().get_title(), fontsize='20') # for legend title
#Display plot
plt.show()
#Setup plot
plt.figure(figsize=(20,10))
sns.set(font_scale=2)
#plot barplot showing total ride count by year and membership type
ax = sns.barplot(x=tidy.year,
y=tidy.value,
hue=tidy.member_type, estimator=sum)
ax.axes.set_title("Figure 20B - Total Ride Count by Year and Membership Type")
ax.set_xlabel("Year")
ax.set_ylabel("Total Rides")
ax.legend(title="Membership Type")
plt.setp(ax.get_legend().get_texts(), fontsize='16') # for legend text
plt.setp(ax.get_legend().get_title(), fontsize='20') # for legend title
#Displau plot
plt.show()
del tidy
#Create intermediate dataframe created using melt to produce a "long-form" dataframe.
tidy=df_usage_perd_bymem.groupby('year').sum()
#Setup plot
plt.figure(figsize=(20,10))
sns.set(font_scale=2)
#plot barplot showing total ride count by year and membership type
ax = sns.barplot(x=tidy.index,
y=tidy.annual_members/tidy.casual_members)
ax.axes.set_title("Figure 20C - Ratio of Annual to Casual Rides by Year")
ax.set_xlabel("Year")
ax.set_ylabel("Ratio of Annual to Casual Rides")
#Displau plot
plt.show()
Impact of COVID-19 on Hourly Usage:
fig,axm = plt.subplots(2,2,figsize = (25,15))
sns.set(font_scale=2)
user_type_str = 'annual'
lst_years = [2017,2018,2019,2020]
for indx in range(len(lst_years)):
plt.subplot(2,2,indx+1)
plt.xlim(0,24)
plt.ylim(0,0.13)
#map data
ax = sns.distplot(df_trips_data[(df_trips_data['user_type'] == user_type_str +' member')
& (df_trips_data['start_time'].dt.year == lst_years[indx])
& (df_trips_data['start_time'].dt.weekday <5)]['hour'],
label = "Weekday (" +str(lst_years[indx])+")", hist = False)
ax = sns.distplot(df_trips_data[(df_trips_data['user_type'] == user_type_str +' member')
& (df_trips_data['start_time'].dt.year == lst_years[indx])
& (df_trips_data['start_time'].dt.weekday >=6)]['hour'],
label = "Weekend (" +str(lst_years[indx])+")", hist = False)
ax.set_title(str(lst_years[indx]),fontsize = 18)
ax.set_xlabel("Hourly Ride Counts", fontsize = 16)
ax.set_ylabel("Probability Density", fontsize = 16)
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.legend(title='Annual Members')
#formate titles
fig.suptitle("Figure 21 - Change in Hourly Distribution of Ride Counts for Annual Members by Year",fontsize = 18)
plt.show()
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
sns.set(font_scale=1.2)
#map data
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'annual member')& (df_trips_data['start_time'].dt.year == 2017)]['hour'], label = "Annual (2017)",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'annual member')& (df_trips_data['start_time'].dt.year == 2018)]['hour'], label = "Annual (2018)",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'annual member')& (df_trips_data['start_time'].dt.year == 2019)]['hour'], label = "Annual (2019)",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'annual member')& (df_trips_data['start_time'].dt.year == 2020)]['hour'], label = "Annual (2020)",ax = ax2, hist = False)
#formate titles
ax2.set_xlim(0,24)
ax2.set_title("Figure 21A - Hourly Distribution of Ride Counts for Annual Members by Year")
ax2.set_xlabel("Hourly Ride Counts")
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax2.legend(title='Membership Type')
plt.show()
fig,axm = plt.subplots(2,2,figsize = (25,15))
sns.set(font_scale=2)
user_type_str = 'casual'
lst_years = [2017,2018,2019,2020]
for indx in range(len(lst_years)):
plt.subplot(2,2,indx+1)
plt.xlim(0,24)
plt.ylim(0,0.14)
#map data
ax = sns.distplot(df_trips_data[(df_trips_data['user_type'] == user_type_str +' member')
& (df_trips_data['start_time'].dt.year == lst_years[indx])
& (df_trips_data['start_time'].dt.weekday <5)]['hour'],
label = "Weekday (" +str(lst_years[indx])+")", hist = False)
ax = sns.distplot(df_trips_data[(df_trips_data['user_type'] == user_type_str +' member')
& (df_trips_data['start_time'].dt.year == lst_years[indx])
& (df_trips_data['start_time'].dt.weekday >=6)]['hour'],
label = "Weekend (" +str(lst_years[indx])+")", hist = False)
ax.set_title(str(lst_years[indx]),fontsize = 18)
ax.set_xlabel("Hourly Ride Counts", fontsize = 16)
ax.set_ylabel("Probability Density", fontsize = 16)
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.legend(title='Casual Members')
#formate titles
fig.suptitle(" Figure 22 - Hourly Distribution of Ride Counts for Casual Members by Year")
fig.tight_layout()
plt.show()
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
sns.set(font_scale=1.2)
ax2.set_xlim(0,24)
#map data
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'casual member')& (df_trips_data['start_time'].dt.year == 2017)]['hour'], label = "Casual (2017)",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'casual member')& (df_trips_data['start_time'].dt.year == 2018)]['hour'], label = "Casual (2018)",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'casual member')& (df_trips_data['start_time'].dt.year == 2019)]['hour'], label = "Casual (2019)",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'casual member')& (df_trips_data['start_time'].dt.year == 2020)]['hour'], label = "Casual (2020)",ax = ax2, hist = False)
#formate titles
ax2.set_title("Figure 22A - Hourly Distribution of Ride Counts for Casual Members by Year")
ax2.set_xlabel("Hourly Ride Counts")
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax2.legend(title='Membership Type')
plt.show()
Impact of COVID-19 on Trip Duration:
First, to get a sense on how the trip duration changes with the pandemic, we derived the statistics for trip duration using the '.groupby.agg()' function. We determined the mean, mode and median trip duration by year and membership type in the variable trip_dur.
#Find statistics on trip duration by year and user type
trip_dur=df_trips_data.groupby(['year','user_type']).agg(Mean_Trip_Dur=('trip_duration',"mean"),
Mode_Trip_Dur=('trip_duration',lambda x: x.mode()),
Median_Trip_Dur=('trip_duration',"median"))
#convert all trip duration into minutes
trip_dur['Mean_Trip_Dur']=trip_dur['Mean_Trip_Dur']/60
trip_dur['Mode_Trip_Dur']=trip_dur['Mode_Trip_Dur']/60
trip_dur['Median_Trip_Dur']=trip_dur['Median_Trip_Dur']/60
#Display DataFrame
trip_dur
| Mean_Trip_Dur | Mode_Trip_Dur | Median_Trip_Dur | ||
|---|---|---|---|---|
| year | user_type | |||
| 2017 | annual member | 10.980670 | 7.350000 | 9.750000 |
| casual member | 17.401354 | 13.100000 | 16.850000 | |
| 2018 | annual member | 11.291797 | 7.050000 | 9.916667 |
| casual member | 17.917738 | 14.550000 | 17.566667 | |
| 2019 | annual member | 11.311180 | 7.133333 | 9.950000 |
| casual member | 17.831801 | 15.266667 | 17.566667 | |
| 2020 | annual member | 12.595727 | 6.116667 | 11.266667 |
| casual member | 18.300827 | 17.566667 | 18.383333 |
The following plot ( Figure 22 ) shows the median trip duration in minutes by year and membership type.
#Setup plot
plt.figure(figsize=(20,10))
sns.set(font_scale=2)
#plot barplot showing average ride count by year and membership type
ax = sns.barplot(x=df_trips_data.year,
y=df_trips_data.trip_duration/60,
hue=df_trips_data.user_type, ci=None, estimator=np.median)
plt.ylim(0,20)
ax.yaxis.set_major_locator(ticker.MultipleLocator(2))
ax.axes.set_title("Figure 22 - Median Trip Duration By Year and Membership Type")
ax.set_xlabel("Year")
ax.set_ylabel("Median Trip Duration (minutes)")
ax.legend()
plt.setp(ax.get_legend().get_texts(), fontsize='16') # for legend text
plt.setp(ax.get_legend().get_title(), fontsize='20') # for legend title
#Displau plot
plt.show()
fig,axm = plt.subplots(2,2,figsize = (25,15))
lst_years = [2017,2018,2019,2020]
for indx in range(len(lst_years)):
plt.subplot(2,2,indx+1)
sns.set(font_scale=1.5)
plt.ylim(0,0.08)
ax = sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'casual member')
& (df_trips_data['start_time'].dt.year == lst_years[indx])]['trip_duration']/60,
label = "casual member", hist = False)
ax = sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'annual member')
& (df_trips_data['start_time'].dt.year == lst_years[indx])]['trip_duration']/60,
label = "annual member", hist = False)
ax.set_xlim(0,36)
ax.set_title(str(lst_years[indx]))
ax.set_xlabel("Trip Duration")
ax.set_ylabel("Probability Density")
ax.xaxis.set_major_locator(ticker.MultipleLocator(2))
ax.legend(title='Membership Type')
#formate titles
fig.suptitle(" Figure 23 - Distribution of Trip Duration by Member Type and Year")
plt.show()
fig1 = plt.figure(figsize=(10,8))
ax2 = fig1.subplots()
sns.set(font_scale=1.5)
ax2.set_xlim(0,36)
#map data
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'annual member')& (df_trips_data['start_time'].dt.year == 2017)]['trip_duration']/60, label = "2017",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'annual member')& (df_trips_data['start_time'].dt.year == 2018)]['trip_duration']/60, label = "2018",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'annual member')& (df_trips_data['start_time'].dt.year == 2019)]['trip_duration']/60, label = "2019",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'annual member')& (df_trips_data['start_time'].dt.year == 2020)]['trip_duration']/60, label = "2020",ax = ax2, hist = False)
#formate titles
ax2.set_title("Figure 23A - Trip Duration Distribution for Annual Members by Year")
ax2.set_xlabel("Trip Duration (min)")
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(2))
ax2.legend(title='Year')
plt.show()
fig1 = plt.figure(figsize=(10,8))
ax2 = fig1.subplots()
sns.set(font_scale=1.5)
ax2.set_xlim(0,36)
#map data
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'casual member')& (df_trips_data['start_time'].dt.year == 2017)]['trip_duration']/60, label = "2017",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'casual member')& (df_trips_data['start_time'].dt.year == 2018)]['trip_duration']/60, label = "2018",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'casual member')& (df_trips_data['start_time'].dt.year == 2019)]['trip_duration']/60, label = "2019",ax = ax2, hist = False)
sns.distplot(df_trips_data[(df_trips_data['user_type'] == 'casual member')& (df_trips_data['start_time'].dt.year == 2020)]['trip_duration']/60, label = "2020",ax = ax2, hist = False)
#formate titles
ax2.set_title("Figure 23B -Trip Duration Distribution for Casual Members by Year")
ax2.set_xlabel("Trip Duration (min)")
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(2))
ax2.legend(title='Year')
plt.show()
We speculated that during the pandemic, the recreational use of the bikeshare increased as opposed to its use for commuting (or specifically for getting from Point A to Point B). To detetmine this, we looked at the number of trips by year where the start and end locations were the same. If users are using the bikeshare for the sole purpose of getting exercise, we would anticipate an increase in the number of trips where the start and end location is the same.
As seen in Figure 24 - Number of Rides With Same Start and End Stations By Year there is a significant increase in the numner of trips with the same start and end stations, which strongly indicate that in 2020 there was an increase in the number of people, both casual and annual members, who began to use the bike share program to get their exercise in rather than for the purpose of getting from Point A to Point B.
#Create variable containing trips where start and end station id are the same
tidy=df_trips_data.loc[df_trips_data['start_station_id']==df_trips_data['end_station_id']]
#Setup plot
plt.figure(figsize=(20,10))
sns.set(font_scale=2)
#plot barplot showing average ride count by year and membership type
ax = sns.countplot(data=tidy, x='year', hue='user_type')
ax.axes.set_title("Figure 24 - Number of Rides With Same Start and End Stations By Year")
ax.set_xlabel("Year")
ax.set_ylabel("Ride Count")
ax.legend()
plt.setp(ax.get_legend().get_texts(), fontsize='16') # for legend text
plt.setp(ax.get_legend().get_title(), fontsize='20') # for legend title
#Displau plot
plt.show()
To determine how statuatory holidays impact demand, we first generated a list of Canadian holidays. Then, we subset the df_trips_data for trips that occurred on statuatory holidays and used the data to determined the hourly usage.
Figure 24 shows the hourly distribution of ride counts comparing usage on statuatory holidays and weekdays. Figure 25 shows the hourly distribution of ride counts comparing usage on statuatory holidays and weekends. Figure 26 shows the hourly distribution of ride counts comparing usage on statuatory holidays and weekends by membership type.
From the analysis it was determined that on statuatory holidays, the hourly usage is similar to that of weekends, for both annual and casual members.
#Generate list of Canadian Holidays
canadian_holidays = holidays.Canada(years = [2017, 2018, 2019, 2020])
#Add a is_holiday feature (boolean) to the df_trips_data based on the presence of the date in the list
df_trips_data['is_holiday'] = [x in canadian_holidays for x in df_trips_data['start_time']]
#Add a isweek_day feature (boolean) to the df_trips_data based whether it is weekday or weekend
df_trips_data['isweek_day'] = df_trips_data.start_time.dt.weekday <5
#Setup plot
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
sns.set(font_scale=1.2)
ax2.set_xlim(0,24)
#Plot the hourly distribution of bike counts for holidays and non-holidays (that are also not weekends)
sns.distplot(df_trips_data[df_trips_data['is_holiday']]['hour'], label = "Holiday", hist = False)
sns.distplot(df_trips_data[~(df_trips_data['is_holiday'])&(df_trips_data['isweek_day'])]['hour'], label = "Weekday", hist = False)
#formate plot
ax2.set_title("Figure 24 - Hourly Distribution of Ride Counts Comparing Statutory Holidays and Weekdays")
ax2.set_xlabel("Hourly Ride Counts")
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax2.legend()
plt.show()
#Setup plot
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
sns.set(font_scale=1.2)
ax2.set_xlim(0,24)
#Plot the hourly distribution of bike counts for holidays and weekends
sns.distplot(df_trips_data[df_trips_data['is_holiday']]['hour'], label = "Holiday",ax = ax2, hist = False)
sns.distplot(df_trips_data[~(df_trips_data['isweek_day'])]['hour'], label = "Weekend",ax = ax2, hist = False)
#format plot
ax2.set_title("Figure 25 - Hourly Distribution of Ride Counts Comparing Statutory Holidays and Weekends")
ax2.set_xlabel("Hourly Ride Counts")
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax2.legend()
plt.show()
#Setup plot
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
sns.set(font_scale=1.2)
ax2.set_xlim(0,24)
#hourly distribution comparison of holiday or weekday by member type
sns.distplot(df_trips_data[df_trips_data['is_holiday'] & (df_trips_data['user_type'] == 'annual member')]['hour'], label = "Annual Member, Stat",ax = ax2, hist = False)
sns.distplot(df_trips_data[~(df_trips_data['isweek_day']) & (df_trips_data['user_type'] == 'annual member')]['hour'], label = "Annual Member, Weekend",ax = ax2, hist = False)
sns.distplot(df_trips_data[df_trips_data['is_holiday']& (df_trips_data['user_type'] == 'casual member')]['hour'], label = "Casual Member, Stat",ax = ax2, hist = False)
sns.distplot(df_trips_data[~(df_trips_data['isweek_day']) & (df_trips_data['user_type'] == 'casual member')]['hour'], label = "Casual Member, Weekend",ax = ax2, hist = False)
#format plot
ax2.set_title("Figure 26- Hourly Distribution of Ride Counts by Membership Type on Statutory Holidays")
ax2.set_xlabel("Hourly Ride Counts")
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax2.legend()
plt.show()
Question 7 and 8 will be answered together below. As part of these questions, we conducted the following analysis:
# Import the map of the Toronto neighborhood
neighbourhoods = gpd.read_file('toronto_neighbourhoods.shp')
# View GeoDataFrame
neighbourhoods.head()
| FIELD_1 | FIELD_2 | FIELD_3 | FIELD_4 | FIELD_5 | FIELD_6 | FIELD_7 | FIELD_8 | FIELD_9 | FIELD_10 | FIELD_11 | FIELD_12 | FIELD_13 | FIELD_14 | FIELD_15 | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2101 | 25886861 | 25926662 | 49885 | 94 | 94 | Wychwood (94) | Wychwood (94) | None | None | -79.425515 | 43.676919 | 16491505 | 3.217960e+06 | 7515.779658 | POLYGON ((-79.43592 43.68015, -79.43492 43.680... |
| 1 | 2102 | 25886820 | 25926663 | 49885 | 100 | 100 | Yonge-Eglinton (100) | Yonge-Eglinton (100) | None | None | -79.403590 | 43.704689 | 16491521 | 3.160334e+06 | 7872.021074 | POLYGON ((-79.41096 43.70408, -79.40962 43.704... |
| 2 | 2103 | 25886834 | 25926664 | 49885 | 97 | 97 | Yonge-St.Clair (97) | Yonge-St.Clair (97) | None | None | -79.397871 | 43.687859 | 16491537 | 2.222464e+06 | 8130.411276 | POLYGON ((-79.39119 43.68108, -79.39141 43.680... |
| 3 | 2104 | 25886593 | 25926665 | 49885 | 27 | 27 | York University Heights (27) | York University Heights (27) | None | None | -79.488883 | 43.765736 | 16491553 | 2.541821e+07 | 25632.335242 | POLYGON ((-79.50529 43.75987, -79.50488 43.759... |
| 4 | 2105 | 25886688 | 25926666 | 49885 | 31 | 31 | Yorkdale-Glen Park (31) | Yorkdale-Glen Park (31) | None | None | -79.457108 | 43.714672 | 16491569 | 1.156669e+07 | 13953.408098 | POLYGON ((-79.43969 43.70561, -79.44011 43.705... |
neighbourhoods['FIELD_8']= neighbourhoods['FIELD_8'].str.split('(', expand = True)
neighbourhoods = neighbourhoods[['geometry', 'FIELD_8']]
neighbourhoods.rename(columns ={'FIELD_8':'neighbourhood'}, inplace = True)
print(neighbourhoods.crs)
neighbourhoods.head()
epsg:4326
| geometry | neighbourhood | |
|---|---|---|
| 0 | POLYGON ((-79.43592 43.68015, -79.43492 43.680... | Wychwood |
| 1 | POLYGON ((-79.41096 43.70408, -79.40962 43.704... | Yonge-Eglinton |
| 2 | POLYGON ((-79.39119 43.68108, -79.39141 43.680... | Yonge-St.Clair |
| 3 | POLYGON ((-79.50529 43.75987, -79.50488 43.759... | York University Heights |
| 4 | POLYGON ((-79.43969 43.70561, -79.44011 43.705... | Yorkdale-Glen Park |
The bikeshare_stations GeoDataFrame does not contain crs information because we contructed it ourselves from (lat,lon) coordinates. However, we know from publicbikesystem.net that the station locations have the same crs as neighbourhoods.
So, the crs of bike stations df_stations was set to EPSG:4326.
Initially, the neighbourhood analysis was conducted by year as we were not sure if the most popular start and end location of bike trips changes from year to year.
#Create function to match neighboorhoods to stations
def find_neighboorhood(row):
for x in range(neighbourhoods.shape[0]):
if row.within(neighbourhoods.loc[x,'geometry']):
return neighbourhoods.loc[x,'neighbourhood']
#Create a function to find the ride count starting from and ending at each neighbourhood
def riders_per_neighbourhood_year(stations_info,df_trips,year,df_neighbourhoods):
df = stations_info.copy()
df_output = df_neighbourhoods.copy()
# calc number of trips starting and ending at each stations based on station id
start = df_trips[df_trips['start_time'].dt.year == year]['start_station_id'].value_counts()
end = df_trips[df_trips['start_time'].dt.year == year]['end_station_id'].value_counts()
#merge start trip value counts and rename column
df = pd.merge(df,start, left_on = 'station_id', right_index = True, how = 'inner')
df.rename(columns = {'start_station_id':'start_ride_count'}, inplace = True)
#merge start trip value counts and rename column
df = pd.merge(df,end, left_on = 'station_id', right_index = True, how = 'inner')
df.rename(columns = {'end_station_id':'end_ride_count'}, inplace = True)
#assign to neighbourhoods
df_output['rides_started'] = df_output.apply(lambda row: df['start_ride_count'][df.within(row.geometry)].sum(), axis = 1)
df_output['rides_ended'] = df_output.apply(lambda row: df['end_ride_count'][df.within(row.geometry)].sum(), axis = 1)
#use focus on neighbourhoods with rides starting and ending within the neighbourhoods
return df_output[(df_output['rides_started']>0) | (df_output['rides_ended']>0)]
#create a stations dataframe with station name, id and neighbourhood
df_stations = pd.read_csv('bikeshare_stations.csv')
df_stations.columns = [s.replace(' ','_').lower() for s in df_stations.columns ]
df_stations = gpd.GeoDataFrame(df_stations,geometry=gpd.points_from_xy(df_stations['lon'],df_stations['lat']))
df_stations.crs = {'init': 'epsg:4326'}
#Use the functions previously defined to determine the ride count originating out of each neighbourhood by year
df_neighbourhood_traffic_2017 = riders_per_neighbourhood_year(df_stations,df_trips_data,2017,neighbourhoods)
df_neighbourhood_traffic_2018 = riders_per_neighbourhood_year(df_stations,df_trips_data,2018,neighbourhoods)
df_neighbourhood_traffic_2019 = riders_per_neighbourhood_year(df_stations,df_trips_data,2019,neighbourhoods)
df_neighbourhood_traffic_2020 = riders_per_neighbourhood_year(df_stations,df_trips_data,2020,neighbourhoods)
df_neighbourhood_traffic_2020.head()
| geometry | neighbourhood | rides_started | rides_ended | |
|---|---|---|---|---|
| 0 | POLYGON ((-79.43592 43.68015, -79.43492 43.680... | Wychwood | 16892 | 13285 |
| 1 | POLYGON ((-79.41096 43.70408, -79.40962 43.704... | Yonge-Eglinton | 11800 | 9455 |
| 2 | POLYGON ((-79.39119 43.68108, -79.39141 43.680... | Yonge-St.Clair | 3322 | 2652 |
| 3 | POLYGON ((-79.50529 43.75987, -79.50488 43.759... | York University Heights | 5295 | 5258 |
| 5 | POLYGON ((-79.50552 43.66281, -79.50577 43.662... | Lambton Baby Point | 2779 | 2805 |
#Check the most popular starting and ending neighbourbook by year
print('In 2017 the neighbourhood with the most trips starting within the neighbourhood was:', df_neighbourhood_traffic_2017.sort_values('rides_started',ascending = False, axis =0)['neighbourhood'].tolist()[0])
print('In 2017 the neighbourhood with the most trips end within the neighbourhood was:', df_neighbourhood_traffic_2017.sort_values('rides_ended',ascending = False, axis =0)['neighbourhood'].tolist()[0])
print('\n')
print('In 2018 the neighbourhood with the most trips starting within the neighbourhood was:', df_neighbourhood_traffic_2018.sort_values('rides_started',ascending = False, axis =0)['neighbourhood'].tolist()[0])
print('In 2018 the neighbourhood with the most trips end within the neighbourhood was:', df_neighbourhood_traffic_2018.sort_values('rides_ended',ascending = False, axis =0)['neighbourhood'].tolist()[0])
print('\n')
print('In 2019 the neighbourhood with the most trips starting within the neighbourhood was:', df_neighbourhood_traffic_2019.sort_values('rides_started',ascending = False, axis =0)['neighbourhood'].tolist()[0])
print('In 2019 the neighbourhood with the most trips end within the neighbourhood was:', df_neighbourhood_traffic_2019.sort_values('rides_ended',ascending = False, axis =0)['neighbourhood'].tolist()[0])
print('\n')
print('In 2020 the neighbourhood with the most trips starting within the neighbourhood was:', df_neighbourhood_traffic_2020.sort_values('rides_started',ascending = False, axis =0)['neighbourhood'].tolist()[0])
print('In 2020 the neighbourhood with the most trips end within the neighbourhood was:', df_neighbourhood_traffic_2020.sort_values('rides_ended',ascending = False, axis =0)['neighbourhood'].tolist()[0])
In 2017 the neighbourhood with the most trips starting within the neighbourhood was: Waterfront Communities-The Island In 2017 the neighbourhood with the most trips end within the neighbourhood was: Waterfront Communities-The Island In 2018 the neighbourhood with the most trips starting within the neighbourhood was: Waterfront Communities-The Island In 2018 the neighbourhood with the most trips end within the neighbourhood was: Waterfront Communities-The Island In 2019 the neighbourhood with the most trips starting within the neighbourhood was: Waterfront Communities-The Island In 2019 the neighbourhood with the most trips end within the neighbourhood was: Waterfront Communities-The Island In 2020 the neighbourhood with the most trips starting within the neighbourhood was: Waterfront Communities-The Island In 2020 the neighbourhood with the most trips end within the neighbourhood was: Waterfront Communities-The Island
Based on the analysis above, it appears that the most popular neighbourhood for start and end of trips is Waterfront Communities-The Island . This does not change from year to year.
As such moving forward with our analysis, the dataset was treated as a single dataset (without breaking it out by year) to determine the neighbourhoods with the largest number of rides departing and ending within their boundaries.
In the analysis below, we did a check to see how many stations were located in each neighbourhood in Toronto.
#Creat a function to count the numner of stations located in each neighbourhood
def station_in_neighbourhood (row):
"row contains the geopanda DataFrame containing the neighborhood geometries"
"It returns the number of bike stations contained in the geometry"
cntr = 0
for x in df_stations.geometry:
if x.within(row.geometry):
cntr +=1
return cntr
#Apply the function to each row in the neighbourhoods geopanda DF to determine numner of stations per neighbourhood
neighbourhoods['stations'] = neighbourhoods.apply(lambda row: station_in_neighbourhood(row), axis=1 )
#Sort values so the station with largest number of station comes on top
neighbourhoods.sort_values(by = ['stations'], inplace = True, ascending = False)
#Create new field called "station density" with station density per unit area in geometry
neighbourhoods['station_density'] = neighbourhoods['stations']/neighbourhoods.geometry.area
# View GeoDataFrame
neighbourhoods.head(10)
| geometry | neighbourhood | stations | station_density | |
|---|---|---|---|---|
| 61 | POLYGON ((-79.37697 43.64688, -79.37576 43.647... | Waterfront Communities-The Island | 60 | 40082.256532 |
| 80 | POLYGON ((-79.38752 43.65067, -79.38663 43.650... | Bay Street Corridor | 47 | 232730.981815 |
| 97 | POLYGON ((-79.37672 43.66242, -79.37658 43.662... | Church-Yonge Corridor | 32 | 210090.029558 |
| 26 | POLYGON ((-79.42778 43.62979, -79.42781 43.629... | Niagara | 31 | 85713.809383 |
| 136 | POLYGON ((-79.40401 43.64719, -79.40419 43.647... | Kensington-Chinatown | 26 | 151796.641042 |
| 49 | POLYGON ((-79.32868 43.64745, -79.32867 43.647... | South Riverdale | 22 | 17978.615195 |
| 77 | POLYGON ((-79.39414 43.66872, -79.39588 43.668... | Annex | 21 | 67410.048564 |
| 18 | POLYGON ((-79.35174 43.65557, -79.35208 43.655... | Moss Park | 21 | 133149.733503 |
| 59 | POLYGON ((-79.40772 43.65648, -79.40847 43.658... | University | 20 | 127498.985118 |
| 106 | POLYGON ((-79.41842 43.66358, -79.41887 43.663... | Dovercourt-Wallace Emerson-Junction | 20 | 48083.657833 |
It was determined that the Waterfront Communities-The Island had the most number of stations located in the area. Since this neighbourhood has the most stations, it could be expected that it would also have the greatest number of rides beginning and ending in the neighbourhood.
In the cell below, we determine the number of rides originating and ending at each station. Then, the values are merged with the df_stations, which is the geopandas Dataframe containing the geometry (point) of each bike station in Toronto. Using this, the ride count at each station is added if the station falls within the boundaries of each neighborhood and assigned to the neighbourhood in the df_neighbourhoods dataframe. Simulatneously, the percentage of total rides starting and ending at each neighbourhood is calculated for each neighbourhood. Finally, the df_neighbourhoods is subset to only contain neighbourhoods with at least one trip originating or ending within its bounds, and saved to df_neighbourhoods_map. This filter removes all neighbourhoods without a bike station within its boundaries.
#copy neighbourhoods into new variable used for this analysis
df_neighbourhoods = neighbourhoods.copy()
# calc number of trips starting and ending at each stations based on station id
start = df_trips_data['start_station_id'].value_counts()
end = df_trips_data['end_station_id'].value_counts()
#merge start trip value counts and rename column
df_stations = pd.merge(df_stations,start, left_on = 'station_id', right_index = True, how = 'left')
df_stations.rename(columns = {'start_station_id':'start_ride_count'}, inplace = True)
#merge start trip value counts and rename column
df_stations = pd.merge(df_stations,end, left_on = 'station_id', right_index = True, how = 'left')
df_stations.rename(columns = {'end_station_id':'end_ride_count'}, inplace = True)
#Create empty columns to calculate percentage of rides start and finished in neighbourhood
df_neighbourhoods['rides_started_perc'] = np.nan
df_neighbourhoods['rides_ended_perc'] = np.nan
df_neighbourhoods['rides_started_count'] = np.nan
df_neighbourhoods['rides_ended_count'] = np.nan
#Sum the ride counts if the station id falls within the neighbourhood boundaries
df_neighbourhoods['rides_started_count'] = df_neighbourhoods.apply(lambda row: df_stations['start_ride_count'][df_stations.within(row.geometry)].sum(), axis = 1)
df_neighbourhoods['rides_ended_count'] = df_neighbourhoods.apply(lambda row: df_stations['end_ride_count'][df_stations.within(row.geometry)].sum(), axis = 1)
df_neighbourhoods['rides_started_perc'] = df_neighbourhoods.apply(lambda row: df_stations['start_ride_count'][df_stations.within(row.geometry)].sum(), axis = 1)/df_trips_data.shape[0]*100
df_neighbourhoods['rides_ended_perc'] = df_neighbourhoods.apply(lambda row: df_stations['end_ride_count'][df_stations.within(row.geometry)].sum(), axis = 1)/df_trips_data.shape[0]*100
#use focus on neighbourhoods with rides starting and ending within the neighbourhoods
#df_neighbourhoods_map subsets the neighbourhoods dataframe so it only contains neighbourhoods with at least one trip
#originating or ending within its boundaries, it removes all neighbourhoods without a bike station within bounds
df_neighbourhoods_map = df_neighbourhoods[(df_neighbourhoods['rides_started_perc']>0) | (df_neighbourhoods['rides_ended_perc']>0)]
df_neighbourhoods_map.head()
| geometry | neighbourhood | stations | station_density | rides_started_perc | rides_ended_perc | rides_started_count | rides_ended_count | |
|---|---|---|---|---|---|---|---|---|
| 61 | POLYGON ((-79.37697 43.64688, -79.37576 43.647... | Waterfront Communities-The Island | 60 | 40082.256532 | 20.558125 | 21.900841 | 1646176.0 | 1753693.0 |
| 80 | POLYGON ((-79.38752 43.65067, -79.38663 43.650... | Bay Street Corridor | 47 | 232730.981815 | 15.257143 | 15.655561 | 1221704.0 | 1253607.0 |
| 97 | POLYGON ((-79.37672 43.66242, -79.37658 43.662... | Church-Yonge Corridor | 32 | 210090.029558 | 8.607226 | 8.017985 | 689217.0 | 642034.0 |
| 26 | POLYGON ((-79.42778 43.62979, -79.42781 43.629... | Niagara | 31 | 85713.809383 | 8.004523 | 8.238169 | 640956.0 | 659665.0 |
| 136 | POLYGON ((-79.40401 43.64719, -79.40419 43.647... | Kensington-Chinatown | 26 | 151796.641042 | 7.740730 | 7.845495 | 619833.0 | 628222.0 |
Below shows the top 5 neighbourhoods with the largest number of rides originating from within its boundaries.
df_neighbourhoods_map.sort_values(by = ['rides_started_perc'], ascending = False)[['neighbourhood','rides_started_perc','rides_started_count']].head()
| neighbourhood | rides_started_perc | rides_started_count | |
|---|---|---|---|
| 61 | Waterfront Communities-The Island | 20.558125 | 1646176.0 |
| 80 | Bay Street Corridor | 15.257143 | 1221704.0 |
| 97 | Church-Yonge Corridor | 8.607226 | 689217.0 |
| 26 | Niagara | 8.004523 | 640956.0 |
| 136 | Kensington-Chinatown | 7.740730 | 619833.0 |
Below shows the top 5 neighbourhoods with the largest number of rides terminating within its boundaries.
df_neighbourhoods_map.sort_values(by = ['rides_ended_perc'], ascending = False)[['neighbourhood','rides_ended_perc','rides_ended_count']].head()
| neighbourhood | rides_ended_perc | rides_ended_count | |
|---|---|---|---|
| 61 | Waterfront Communities-The Island | 21.900841 | 1753693.0 |
| 80 | Bay Street Corridor | 15.655561 | 1253607.0 |
| 26 | Niagara | 8.238169 | 659665.0 |
| 97 | Church-Yonge Corridor | 8.017985 | 642034.0 |
| 136 | Kensington-Chinatown | 7.845495 | 628222.0 |
In the map below, we have plotted the Choropleth map of Toronto showing the neighbourhoods with the largest number of rides depart from bike stations located within their boundaries.
# Create a base map
map_3 = folium.Map(location=[43.6559811,-79.3864663],
tiles='cartodbpositron',
zoom_start=12)
# Add station to the map
for idx, row in df_stations.to_crs(epsg=4326).iterrows():
folium.CircleMarker([row.geometry.y, row.geometry.x],
radius=1,
color='blue',fill=True,fill_color='#3186cc',
fill_opacity=0,parse_html=False).add_to(map_3)
# Add a choropleth map to the base map
Choropleth(geo_data=df_neighbourhoods_map.__geo_interface__,
columns=['neighbourhood', 'rides_started_perc'],
data=df_neighbourhoods_map,
key_on='feature.properties.neighbourhood',
fill_color='YlOrRd',
legend_name='Percentage of Rides Starting'
).add_to(map_3)
map_3
In the map below, we have plotted the Choropleth map of Toronto showing the neighbourhoods with the largest number of rides end at bike stations located within their boundaries.
# Create a base map
map_3 = folium.Map(location=[43.6559811,-79.3864663],
tiles='cartodbpositron',
zoom_start=12)
# Add station to the map
for idx, row in df_stations.to_crs(epsg=4326).iterrows():
folium.CircleMarker([row.geometry.y, row.geometry.x],
radius=1,
color='blue',fill=True,fill_color='#3186cc',
fill_opacity=0,parse_html=False).add_to(map_3)
# Add a choropleth map to the base map
Choropleth(geo_data=df_neighbourhoods_map.__geo_interface__,
columns=['neighbourhood', 'rides_ended_perc'],
data=df_neighbourhoods_map,
key_on='feature.properties.neighbourhood',
fill_color='YlOrRd',
legend_name='Percentage of Rides Ending'
).add_to(map_3)
map_3
It is interesting to note that if the number of rides departing or terminating in a neighbourhood is normalized to the size of the neighbourhood, we obtain a slightly different results. In the maps below, we created two columns with the number of rides normalized to the area of each neighbourhood. When we plot the result, there are some neighbourhoods that begin pop out that were not seen in the previous maps.
#Determine the rides per area of neighbourhood
df_neighbourhoods_map['start_ride_per_area']=df_neighbourhoods['rides_started_count']/df_neighbourhoods.geometry.area/10000
df_neighbourhoods_map['end_ride_per_area']=df_neighbourhoods['rides_ended_count']/df_neighbourhoods.geometry.area/10000
#Top 5 neighbourhoods based on number of start rides normalized to neighbourhood area
df_neighbourhoods_map.sort_values(by = ['start_ride_per_area'], ascending = False)[['neighbourhood','rides_started_perc','rides_started_count','start_ride_per_area']].head()
| neighbourhood | rides_started_perc | rides_started_count | start_ride_per_area | |
|---|---|---|---|---|
| 80 | Bay Street Corridor | 15.257143 | 1221704.0 | 604953.981717 |
| 97 | Church-Yonge Corridor | 8.607226 | 689217.0 | 452492.562193 |
| 28 | North St.James Town | 2.234189 | 178901.0 | 377774.912506 |
| 136 | Kensington-Chinatown | 7.740730 | 619833.0 | 361879.105411 |
| 59 | University | 5.267175 | 421765.0 | 268873.047291 |
##Top 5 neighbourhoods based on number of end rides normalized to neighbourhood area
df_neighbourhoods_map.sort_values(by = ['end_ride_per_area'], ascending = False)[['neighbourhood','rides_ended_perc','rides_ended_count','end_ride_per_area']].head()
| neighbourhood | rides_ended_perc | rides_ended_count | end_ride_per_area | |
|---|---|---|---|---|
| 80 | Bay Street Corridor | 15.655561 | 1253607.0 | 620751.463659 |
| 97 | Church-Yonge Corridor | 8.017985 | 642034.0 | 421515.443866 |
| 136 | Kensington-Chinatown | 7.845495 | 628222.0 | 366776.882418 |
| 28 | North St.James Town | 1.710113 | 136936.0 | 289159.844936 |
| 59 | University | 4.909570 | 393130.0 | 250618.380097 |
Below shows a map of top neighbourhoods with largest number of start rides originating out of the neighbourhood, normalized to neighbourhood area.
# Create a base map
map_3 = folium.Map(location=[43.6559811,-79.3864663],
tiles='cartodbpositron',
zoom_start=12)
# Add station to the map
for idx, row in df_stations.to_crs(epsg=4326).iterrows():
folium.CircleMarker([row.geometry.y, row.geometry.x],
radius=1,
color='blue',fill=True,fill_color='#3186cc',
fill_opacity=0,parse_html=False).add_to(map_3)
# Add a choropleth map to the base map
Choropleth(geo_data=df_neighbourhoods_map.__geo_interface__,
columns=['neighbourhood', 'start_ride_per_area'],
data=df_neighbourhoods_map,
key_on='feature.properties.neighbourhood',
fill_color='YlOrRd',
legend_name='Start Rides Per Neighbourhood Area (x10,000)'
).add_to(map_3)
map_3
Below shows a map of top neighbourhoods with largest number of end rides terminating in the neighbourhood, normalized to neighbourhood area.
# Create a base map
map_3 = folium.Map(location=[43.6559811,-79.3864663],
tiles='cartodbpositron',
zoom_start=12)
# Add station to the map
for idx, row in df_stations.to_crs(epsg=4326).iterrows():
folium.CircleMarker([row.geometry.y, row.geometry.x],
radius=1,
color='blue',fill=True,fill_color='#3186cc',
fill_opacity=0,parse_html=False).add_to(map_3)
# Add a choropleth map to the base map
Choropleth(geo_data=df_neighbourhoods_map.__geo_interface__,
columns=['neighbourhood', 'end_ride_per_area'],
data=df_neighbourhoods_map,
key_on='feature.properties.neighbourhood',
fill_color='YlOrRd',
legend_name='End Ride Counts Per Neighbourhood Area (x10,000)'
).add_to(map_3)
map_3
As part of question 9 and 10, we will explore the following questions:
What kind of weather and temperature conditions are most of the bike trips taken in?
How does temperature, relative humidity and weather conditions impact the trip duration?
How does the relative humidity, temperature and weather condition affect the hourly and daily ride count?
Bike rides tends to be popular when the temperature is between 15 and 27 degrees Celsius based on Figure 28 - Number of Bike Trips as a Function of Temperature (2017 - 2020). It seems like there are a few people who continue to use the bike share program between 0 to 15 degrees Celsius. Outside these temperature ranges, the number of users who use the bike share system is very limited.
#Subset the data to only include trip records that contain temperature data
subset_temp=df_trips_data[df_trips_data['temp_c'].notnull()]
subset_temp.shape
(7958948, 35)
#Subset the data to only include trip records that contain temperature data
subset_temp=df_trips_data[df_trips_data['temp_c'].notnull()]
sns.set(font_scale=1.5)
#Set up plot area
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
#map data
sns.distplot(subset_temp['temp_c'],ax = ax2, kde=True, hist=False)
#formate titles
ax2.set_title("Figure 28 - Number of Bike Trips as a Function of Temperature (2017 - 2020)")
ax2.set_xlabel("Temperature (Celsius)" )
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(5))
plt.show()
Similar analysis of the relative humidity shown in Figure 29 - Number of Bike Trips as a Function of Reative Humidity (2017 - 2020) also show that there is perhaps an optimal humidity range around 65 and 85 % where riders prefer to go riding.
#Subset the data to only include trip records that contain temperature data
subset_hum=df_trips_data[df_trips_data['rel_hum_'].notnull()]
sns.set(font_scale=1.5)
#Set up plot area
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
#map data
sns.distplot(subset_hum['rel_hum_'],ax = ax2, kde=True, hist=False)
#formate titles
ax2.set_title("Figure 29 - Number of Bike Trips as a Function of Reative Humidity (2017 - 2020)")
ax2.set_xlabel("Relative Humidity (%)" )
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(5))
plt.show()
The weather descriptions in the 'weather' column found in df_trips_data were simplified into the dominant types. For example, Thunderstorms,Rain,Fog' was simplified into 'Thunderstorm' and 'Snow,Blowing_Snow' into 'Snow' based on the first weather phenomena included in the description. This information is stored in the column called category.
The proportion of bike rides taken in different weather conditions is shown below in Figure 30 - Percentage of Bike Trips in Database as Function of Weather Condition . The data shows that poor weather conditions strongly deters users from using the bikes, given that over 92% of bike trips in the entire database were taken when the weather was clear.
#Create a new column called category that simplifies the weather condition
df_trips_data['category']=df_trips_data['weather'].str.split(",").str[0]
df_trips_data['category']=df_trips_data['category'].str.replace("_"," ")
df_trips_data['category']=df_trips_data['category'].str.capitalize()
#Use the groupby function to find the # of trips by weather condition and relative proportion
subset_weather_group = df_trips_data.groupby(by='category').agg(trip_count=('start_station_id',"count"))
subset_weather_group['perc_total']=subset_weather_group.trip_count/subset_weather_group.trip_count.sum()*100
fig,axm = plt.subplots(2,1,figsize = (15,15))
sns.set(font_scale=1.5)
#Plot the percentage of total rides occuring during each weather condition including clear days
plt.ylim(0,100)
ax=sns.barplot(ax=axm[0], x=subset_weather_group.index,y=subset_weather_group.perc_total, ci=None)
ax.yaxis.set_major_locator(ticker.MultipleLocator(10))
plt.setp(ax.xaxis.get_majorticklabels(), rotation=90)
#Add labels to each bar
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=12, color='black', xytext=(0, 20),
textcoords='offset points')
#format sub-titles
ax.set_title("Percentage of Bike Trips as a Function of Weather Condition (Including Clear Days)",fontsize = 18)
ax.set_xlabel("Weather Conditions")
ax.set_ylabel("% of Total Bike Trips")
#Subset the data so we can show a close up of the non-clear days
subset_weather_group=subset_weather_group[~(subset_weather_group.index =='Clear day')]
#Plot the percentage of total rides occuring during each weather condition excluding clear days
plt.ylim(0,5)
ax2=sns.barplot(ax=axm[1], x=subset_weather_group.index,y=subset_weather_group.perc_total, ci=None)
ax2.yaxis.set_major_locator(ticker.MultipleLocator(1))
plt.setp(ax2.xaxis.get_majorticklabels(), rotation=90)
#add labels to each bar
for p in ax2.patches:
ax2.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=12, color='black', xytext=(0, 20),
textcoords='offset points')
#format titles
ax2.set_title("Percentage of Bike Trips as a Function of Weather Condition (Excluding Clear Days)",fontsize = 18)
ax2.set_xlabel("Weather Conditions")
ax2.set_ylabel("% of Total Bike Trips")
#Set Plot Title
plt.tight_layout()
plt.subplots_adjust(top=0.9)
fig.suptitle("Figure 30 - Percentage of Bike Trips in Database as Function of Weather Condition")
plt.show()
Moving forward, it will be valuable to distinguish between good weather (clear day) and poor weather (all other weather phenomenon). To achieve this, we create a new column in df_trip_data called weather2 where 'Clear day' is called 'Clear' and all other weather phenomenon is called 'Precipitation'. Doing so, we can reduce the weather condition into two categories as shown in Figure 31 - Total Bike Rides from 2017 to 2020 as a Function of Weather Condition.
#Create a new field which called Clear day "Clear" and all other weather phenonenon as "Precipitation"
df_trips_data['weather2']=df_trips_data['category'].replace(to_replace='Clear day', value=np.nan)
df_trips_data['weather2']=df_trips_data['weather2'].apply(lambda x: 'Clear' if pd.isnull(x) else 'Precipitation')
#Use groupby function to determine relative proportion of each weather condition
tidy=df_trips_data.groupby('weather2').agg(Count=('user_type', "count"))
#Setup plot
fig1 = plt.figure(figsize=(10,5))
sns.set(font_scale=1.2)
#Barplot showing the relative percentage of each weather type: clear or precipitation
ax=sns.barplot(x=tidy.index, y=tidy.Count/len(df_trips_data)*100)
#add labels to each bar
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=12, color='black', xytext=(0, 10),
textcoords='offset points')
#format sub-titles
plt.ylim(0,100)
plt.tight_layout()
ax.set_title("Figure 31 - Total Bike Rides from 2017 to 2020 as a Function of Weather Condition")
ax.set_xlabel("Weather Condition")
ax.set_ylabel("% of Bike Rides in Database")
plt.show()
del tidy
In Figure 32 - Comparison of Temperature and Relative Humidity for Trips Taken by Annual and Casual Members , the temperature and relative humidity of each trip was plotted on a kde plot by membership type. This reveals something very interesting about the rider preference. The annual riders tend to be less fussy about the temperature and humidity condition compared to casual riders when making decisions about taking a ride.
#Set up plot
fig1 = plt.figure(figsize=(15,15))
sns.set(font_scale=1.2)
#Randomly sample the df_trips_data for 80,000 points without replacement
#Data is sampled because it takes too long to create the graph with close to 8 million trips
random_subset = df_trips_data.sample(n=80000)
#Contour plot -relative distribution of relative humidity and temperature for trips taken
contourplot=sns.displot(data=random_subset , x='temp_c',
y='rel_hum_', kind='kde', hue='user_type')
#format plot
contourplot.fig.subplots_adjust(top=0.8)
contourplot.fig.suptitle("Figure 32 - Comparison of Temperature and Relative Humidity for \n Trips Taken by Annual and Casual Members ")
contourplot.set_axis_labels(x_var="Temperature (Celsius)", y_var="Relative Humidity (%)")
plt.show()
<Figure size 1080x1080 with 0 Axes>
Figure 33A/B - The Effect of Weather on the Distribution of Trip Duration uses a violin plot to examine the influence of weather condition on the trip duration. Although it appears the trip duration may slightly higher for clear days, on average, the weather condition doesn't seem to significantly impact the duration of trips. Even in bad weather, once the rider is committed to taking the ride, they do not seem to terminate the trip early or attempt to speed to their next destination.
#calculate the duration of trip in minutes
df_trips_data['trip_dur_min']=df_trips_data['trip_duration']/60
#Set up plot
fig1 = plt.figure(figsize=(10,5))
sns.set(font_scale=1.5)
#Violin plot of trip duration in minutes by weather category
ax2=sns.violinplot(x='category', y='trip_dur_min', data=df_trips_data, cut=0)
plt.xticks(rotation='vertical')
ax2.yaxis.set_major_locator(ticker.MultipleLocator(5))
#formate titles
ax2.set_title("Figure 33A - The Effect of Weather on the Distribution of Trip Duration",fontsize = 18)
ax2.set_xlabel("Weather Condition")
ax2.set_ylabel("Trip Duration (min)")
plt.show()
fig1 = plt.figure(figsize=(10,5))
sns.set(font_scale=1.5)
ax2=sns.violinplot(x='weather2', y='trip_dur_min', data=df_trips_data, cut=0)
plt.xticks(rotation='vertical')
ax2.yaxis.set_major_locator(ticker.MultipleLocator(5))
#formate titles
ax2.set_title("Figure 33B - The Effect of Weather on the Distribution of Trip Duration",fontsize = 18)
ax2.set_xlabel("Weather Condition")
ax2.set_ylabel("Trip Duration (min)")
plt.show()
The below figure Figure 34 - The Effect of Temperature on the Distribution of Trip Duration is a violin plot showing the distribution of trip duration as a function of temperture. It shows that as the temperture increases, there is a larger proportion of longer trips.
#create weather bins
bins = np.arange(-25, 40, 5).tolist()
df_trips_data['bin'] = pd.cut(df_trips_data['temp_c'], bins=bins, labels=[f'{l} to {l+5}' for l in range(-25,35,5)])
#Set up plot
fig1 = plt.figure(figsize=(10,5))
sns.set(font_scale=1.5)
#Violin plot of trip duration by temperature bins
ax2=sns.violinplot(x='bin', y='trip_dur_min', data=df_trips_data,cut=0)
plt.xticks(rotation='vertical')
ax2.yaxis.set_major_locator(ticker.MultipleLocator(5))
#format titles
ax2.set_title("Figure 34 The Effect of Temperature on the Distribution of Trip Duration",fontsize = 18)
ax2.set_xlabel("Temperature Range (Celsius)")
ax2.set_ylabel("Trip Duration (min)")
plt.show()
Figure 35 - The Effect of Relative Humidity on the Distribution of Trip Duration looks at the influence of relative humidity on the trip duration. Unlike temperature, the relative humidity does not seem to have a significant impact on the trip duration.
#Break the relative humidity into bins
bins = np.arange(50, 100, 5).tolist()
df_trips_data['bin'] = pd.cut(df_trips_data['rel_hum_'], bins=bins, labels=[f'{l} to {l+5}' for l in range(50,95,5)])
#Set up plot
fig1 = plt.figure(figsize=(10,5))
sns.set(font_scale=1.5)
#Violin plot of trip duration by relative humidity bins
ax2=sns.violinplot(x='bin', y='trip_dur_min', data=df_trips_data,cut=0)
plt.xticks(rotation='vertical')
ax2.yaxis.set_major_locator(ticker.MultipleLocator(5))
#format titles
ax2.set_title("Figure 35 - The Effect of Relative Humidity on the Distribution of Trip Duration",fontsize = 18)
ax2.set_xlabel("Relative Humidity (%)")
ax2.set_ylabel("Trip Duration (min)")
plt.show()
To further investigate the impact of weather condition on the bike usage, we use the groupby.agg() function to determine:
This information is saved in the variable called hourly_rides_and_weather.
hourly_rides_and_weather = df_trips_data.groupby(pd.Grouper(key="start_time",
freq='H')).agg(rides=('user_type',"count"), #total number of rides per hour
annual_members=('user_type', lambda x: (x == 'annual member').sum()), #number of casual riders in hour
casual_members=('user_type', lambda x: (x == 'casual member').sum()), #number of annual riders in hour
temp=('temp_c', 'max'), #maximum temperature recorded in the hour
humidity=('rel_hum_', 'max'), #maximum relative humidity recorded
wind=('wind_spd_kmh', 'max'),
weather=('category', lambda x: np.nan if x.isnull().all() else x.value_counts().index[0]), #most commonly occurring weather phenonenon
weather2=('weather2', lambda x: np.nan if x.isnull().all() else x.value_counts().index[0])) #most commonly occurring weather phenonenon, clear or precipitation
hourly_rides_and_weather.head()
| rides | annual_members | casual_members | temp | humidity | wind | weather | weather2 | |
|---|---|---|---|---|---|---|---|---|
| start_time | ||||||||
| 2017-01-01 00:00:00-05:00 | 18 | 16 | 2 | 1.5 | 69.0 | 39.0 | Clear day | Clear |
| 2017-01-01 01:00:00-05:00 | 13 | 13 | 0 | 1.5 | 68.0 | 35.0 | Clear day | Clear |
| 2017-01-01 02:00:00-05:00 | 15 | 15 | 0 | 1.2 | 68.0 | 37.0 | Clear day | Clear |
| 2017-01-01 03:00:00-05:00 | 10 | 8 | 2 | 1.3 | 67.0 | 37.0 | Clear day | Clear |
| 2017-01-01 04:00:00-05:00 | 5 | 5 | 0 | 1.3 | 69.0 | 30.0 | Clear day | Clear |
Figure 36 - The Hourly Ride Count as a Function of Weather Condition shows the distribution of the hourly ride count as a function of weather condition and membership type. On average, there are significantly more riders in the hour if the weather condition is clear. Nevertheless, the boxplots show that there are there are significant fluctuations and hence outliers in the data. Some of the outliers have been cut from the plot to show the box plot at an appropriate scale.
#Set up plot
fig1 = plt.figure(figsize=(10,8))
sns.set(font_scale=1.5)
#Boxplot for hourly ride count as function of weather
ax=sns.boxplot(x="weather2", y="rides",data=hourly_rides_and_weather)
#Format plot
ax.set_title("Figure 36 - The Hourly Ride Count as a Function of Weather Condition")
ax.set_xlabel("Weather Condition")
ax.set_ylabel("Hourly Ride Count")
ax.set_ylim(0, 1500)
plt.show()
Figure 37 - Effect of Wind Speed on Hourly Ride Counts shows the effect of wind speed on the hourly ride counts. There is a strong negative correlation between wind speed and ride counts for both annual and casual members.
plt.figure(figsize=(10,5))
hum=sns.scatterplot(x=hourly_rides_and_weather['wind'],
y=hourly_rides_and_weather['annual_members'],
label="Annual Members")
sns.scatterplot(x=hourly_rides_and_weather['wind'],
y=hourly_rides_and_weather['casual_members'],
label="Casual Members")
hum.axes.set_title("Figure 37 - Effect of Wind Speed on Hourly Ride Counts",
fontsize=16)
hum.set_xlabel("Maximum Wind Speed (kmh)", fontsize=16)
#ride_scatter.set_ylim(0, 0.00035)
#ride_count.set_xlim(0, 14000)
hum.set_ylabel("Hourly Ride Counts",
fontsize=16)
hum.legend(fontsize=14)
plt.show()
To investigate the impact of weather condition on the daily bike usage, we use the groupby.agg() function to determine:
This information is saved in the variable called daily_rides_and_weather.
#Replace all 'Clear' with NAN in order to allow weather class function to work
hourly_rides_and_weather['weather2']=hourly_rides_and_weather['weather2'].replace(to_replace='Clear', value=np.nan)
#Assign the appropriate weather to daily data based on dominant (50% occurence throughout day) weather condition
def weatherclass(series):
series.replace(to_replace='Clear', value=np.nan)
if series.count()> 0.5*len(series):
x='Precipitation'
else:
x='Clear'
return x
#Use the groupby function to determine daily rides and associated weather condition
daily_rides_and_weathers = hourly_rides_and_weather.groupby(hourly_rides_and_weather.index.floor('D')).agg(rides=('rides','sum'),
annual_members=('annual_members','sum'),
casual_members=('casual_members', 'sum'),
temp=('temp', 'max'),
humidity=('humidity', 'max'),
weather=('weather2', weatherclass))
#Create new column to know whether each day is a weekday or weekend
daily_rides_and_weathers['isweek_day'] = daily_rides_and_weathers.index.weekday <5
daily_rides_and_weathers.loc[daily_rides_and_weathers['isweek_day'], 'isweek_day'] = 'weekday'
daily_rides_and_weathers.loc[daily_rides_and_weathers['isweek_day'] ==False,'isweek_day'] = 'weekend'
daily_rides_and_weathers.head()
| rides | annual_members | casual_members | temp | humidity | weather | isweek_day | |
|---|---|---|---|---|---|---|---|
| start_time | |||||||
| 2017-01-01 00:00:00-05:00 | 482 | 412 | 70 | 3.0 | 77.0 | Clear | weekend |
| 2017-01-02 00:00:00-05:00 | 826 | 756 | 70 | 4.7 | 95.0 | Clear | weekday |
| 2017-01-03 00:00:00-05:00 | 871 | 853 | 18 | 5.1 | 98.0 | Precipitation | weekday |
| 2017-01-04 00:00:00-05:00 | 1395 | 1361 | 34 | 3.9 | 94.0 | Clear | weekday |
| 2017-01-05 00:00:00-05:00 | 1210 | 1191 | 19 | -5.6 | 72.0 | Clear | weekday |
Figure 38 - Effect of Weather Condition on Daily Ride Counts shows the distribution of daily ride counts by weather conditions. There are typically very low daily ride counts observed when more than 50% of the day is poor weather conditions. For clear days (or days with more than 50% of the day is clear), the daily ride counts are much higher.
plt.figure(figsize=(8,5))
weather=sns.violinplot(x=daily_rides_and_weathers['weather'],
y=daily_rides_and_weathers['rides'],cut=0)
weather.set_ylabel("Daily Ride Counts", fontsize=16)
weather.set_xlabel("Weather Conditions", fontsize=16)
weather.set_title("Figure 38 - Effect of Weather Condition on Daily Ride Counts",
fontsize=16)
plt.show()
Figure 39 - Effect of Temperature on Daily Ride Counts show the daily ride counts as a function of temperature. Both for annual and casual members, the daily ride counts tend to increase as the temperature increases. There seems to be a peak around 25 -27 degrees Celsius. It is interesting to note that most of the riders the use the bike share below 0 degrees is predominantly annual members.
plt.figure(figsize=(10,5))
temp=sns.scatterplot(x=daily_rides_and_weathers['temp'],
y=daily_rides_and_weathers['annual_members'],
label="Annual Members")
sns.scatterplot(x=daily_rides_and_weathers['temp'],
y=daily_rides_and_weathers['casual_members'],
label="Casual Members")
temp.axes.set_title("Figure 39 - Effect of Temperature on Daily Ride Counts",
fontsize=16)
temp.set_xlabel("Maximum Daily Temperature (Celsius)", fontsize=16)
#ride_scatter.set_ylim(0, 0.00035)
#ride_count.set_xlim(0, 14000)
temp.set_ylabel("Daily Ride Counts",
fontsize=16)
temp.legend(fontsize=14)
plt.show()
Figure 40 - Effect of Relative Humidity on Daily Ride Counts show the daily ride counts as a function of relative humidity. Correlation between the daily ride count and relative humidity is much weaker than what was observed for temperature, although there is a slight positive correlation for both for annual and casual members.
plt.figure(figsize=(10,5))
hum=sns.scatterplot(x=daily_rides_and_weathers['humidity'],
y=daily_rides_and_weathers['annual_members'],
label="Annual Members")
sns.scatterplot(x=daily_rides_and_weathers['humidity'],
y=daily_rides_and_weathers['casual_members'],
label="Casual Members")
hum.axes.set_title("Figure 40 - Effect of Relative Humidity on Daily Ride Counts",
fontsize=16)
hum.set_xlabel("Maximum Daily Relative Humidity (%)", fontsize=16)
#ride_scatter.set_ylim(0, 0.00035)
#ride_count.set_xlim(0, 14000)
hum.set_ylabel("Daily Ride Counts",
fontsize=16)
hum.legend(fontsize=14)
plt.show()
Figure 41 - The Daily Ride Count as a Function of Member Type and Weather Condition shows the distribution of the daily ride count as a function of weather condition and membership type. The graph demonstrates that weather condition is a very strong predictor of the daily ride count for both annual and casual riders.
#Create intermediate dataframe created using melt to produce a "long-form" dataframe.
tidy = pd.melt(daily_rides_and_weathers, id_vars=['weather'], value_vars=['casual_members','annual_members'], var_name='member_type')
#Set up plot
fig1 = plt.figure(figsize=(10,8))
sns.set(font_scale=1.2)
#Boxplot for daily ride count as function of weather and membership type
ax=sns.boxplot(x="weather", y="value", data=tidy, hue="member_type")
#Format plot
ax.set_title("Figure 41 - The Daily Ride Count as a Function of Member Type and Weather Condition")
ax.set_xlabel("Weather Condition")
ax.set_ylabel("Daily Ride Count")
plt.show()
del tidy
From the above analyses, we determined that wind speed, temperature and precipitation (gauged through the weather condition) are the strongest predictors of bike usage.
Below, we have conducted a bivariate kde analysis for temperature and wind speed. It shows that while annual members are not as fussy as the casual members about the temperature conditions, both member types are similar in their preference to not ride in high wind speeds.
#Set up plot
fig1 = plt.figure(figsize=(15,15))
sns.set(font_scale=1.2)
#Randomly sample the df_trips_data for 80,000 points without replacement
#Data is sampled because it takes too long to create the graph with close to 8 million trips
random_subset = df_trips_data.sample(n=80000)
#Contour plot -relative distribution of relative humidity and temperature for trips taken
contourplot=sns.displot(data=random_subset , x='temp_c',
y='wind_spd_kmh', kind='kde', hue='user_type')
#format plot
contourplot.fig.subplots_adjust(top=0.8)
contourplot.fig.suptitle("Figure 41 - Comparison of Temperature and Wind Speed for \n Trips Taken by Annual and Casual Members ")
contourplot.set_axis_labels(x_var="Temperature (Celsius)", y_var="Wind Speed (kmh)")
plt.show()
<Figure size 1080x1080 with 0 Axes>
As part of this question, we will first determine the which bike stations are 200m of the subwys station. Then, this information will be used to subset the data into two groups. Trips that originate or end near a subway station, and trips that do not start or end near a subway station.
Then, we will examine the differences in the hourly and weekly usage for these two groups of trips.
We tried to incorporate the impact of having a street car route nearby as well but the shp file for street car routes were not available.
#import the subway station locations
subway_stations = gpd.read_file("subway_stations.shp").to_crs(epsg=26917)
subway_stations.head()
| STATION | LINE | PLATFORM_L | AVG_PASSEN | LINE2 | PLATFORM_1 | SUBWAY_TRA | ADDRESS | Opened | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Kipling | Bloor-Danforth | 1 | 53640 | None | None | False | 5247 Dundas St. West | 1980 | POINT (618101.613 4832636.300) |
| 1 | Islington | Bloor-Danforth | 1 | 43090 | None | None | False | 3286 Bloor St. West | 1968 | POINT (618990.613 4833544.113) |
| 2 | Royal York | Bloor-Danforth | 2 | 19440 | None | None | False | 3012 Bloor St. West | 1968 | POINT (620056.496 4833882.764) |
| 3 | Old Mill | Bloor-Danforth | 2 | 5780 | None | None | False | 2672 Bloor St. West | 1968 | POINT (621361.678 4834111.901) |
| 4 | Jane | Bloor-Danforth | 2 | 16730 | None | None | False | 2440 Bloor St. West | 1968 | POINT (622220.664 4834091.381) |
We want to see which bike stations are within 200 meters of a subway stations and the simplest way to do this is to create a buffer. The 'geometry' column in a GeoDataFrame has a method called .buffer(), which takes a radius argument in the units of your crs (meters in the case of EPSG:26917
A new variable called subway_stations_buffer and set it equal to subway_stations with a 200 meter buffer applied.
#Create a 200m buffer around each subway station
subway_stations_buffer = subway_stations.buffer(200)
subway_stations_buffer.head()
0 POLYGON ((618301.613 4832636.300, 618300.650 4... 1 POLYGON ((619190.613 4833544.113, 619189.650 4... 2 POLYGON ((620256.496 4833882.764, 620255.533 4... 3 POLYGON ((621561.678 4834111.901, 621560.714 4... 4 POLYGON ((622420.664 4834091.381, 622419.701 4... dtype: geometry
# Create map with release incidents and monitoring stations
map_8 = folium.Map(location=[43.67422832658063, -79.39941688479091],
tiles='cartodbpositron',
zoom_start=11)
# Plot each polygon on the map
GeoJson(subway_stations_buffer.to_crs(epsg=4326)).add_to(map_8)
# Show the map
map_8
Now we want to test if a bike station is within 200 meters of a subway station. First we collapse all of the subway station buffer POLYGON into a MULTIPOLYGON object. We do this with the unary_union attribute.
#collapse all of the buffer `POLYGON` into a `MULTIPOLYGON`
subway_stations_union = subway_stations_buffer.geometry.unary_union
#convert it to a GeoDataFrame
subway_stations_union = gpd.GeoDataFrame(geometry=[subway_stations_union], crs='EPSG:26917')
subway_stations_union.head()
| geometry | |
|---|---|
| 0 | MULTIPOLYGON (((618301.613 4832636.300, 618300... |
Create a new column in subway_stations called 'bike_access' and assign it a boolean value. True if the station is within 200 meters of a bike station and False if it is not.
df_stations['subway_access'] = df_stations.to_crs(epsg=26917).apply(lambda row: subway_stations_union.contains(row.geometry),axis=1)
df_stations.head()
| station_id | station_name | lat | lon | capacity | geometry | start_ride_count | end_ride_count | subway_access | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 7000 | Fort York Blvd / Capreol Ct | 43.639832 | -79.395954 | 35 | POINT (-79.39595 43.63983) | 53997.0 | 50623.0 | False |
| 1 | 7001 | Lower Jarvis St / The Esplanade | 43.647830 | -79.370698 | 15 | POINT (-79.37070 43.64783) | 28720.0 | 34653.0 | False |
| 2 | 7002 | St. George St / Bloor St W | 43.667333 | -79.399429 | 19 | POINT (-79.39943 43.66733) | 42313.0 | 38205.0 | True |
| 3 | 7003 | Madison Ave / Bloor St W | 43.667158 | -79.402761 | 15 | POINT (-79.40276 43.66716) | 26485.0 | 22875.0 | True |
| 4 | 7004 | University Ave / Elm St | 43.656518 | -79.389099 | 11 | POINT (-79.38910 43.65652) | 23373.0 | 22723.0 | True |
# Create map of Toronto
map_9 = folium.Map(location=[43.6615874,-79.3808117],
tiles='cartodbpositron',
zoom_start=12)
# Plot subway polygon on the map with 200m buffer
GeoJson(subway_stations_union.to_crs(epsg=4326)).add_to(map_9)
# Add bike station points to the map
for idx, row in df_stations.to_crs(epsg=4326).iterrows():
if row['subway_access']==True:
Marker([row.geometry.y, row.geometry.x],
icon=folium.Icon(color='green')).add_to(map_9)
else:
Marker([row.geometry.y, row.geometry.x],
icon=folium.Icon(color='red')).add_to(map_9)
#Show the map
map_9
#Subset bike station dataframe for only ones within 200m of a subway station
near_station = df_stations[df_stations['subway_access']==True]
far_station = df_stations[df_stations['subway_access']==False]
#There are 102 stations near the subway
print("Number of bike stations closer than 200m from subway:", len(near_station.station_id.unique()))
#There are 502 stations far from the subway
print("Number of bike stations more than 200m away from subway:", len(far_station.station_id.unique()))
Number of bike stations closer than 200m from subway: 102 Number of bike stations more than 200m away from subway: 508
Create two variables, near_station_trips contains trips with a start or end destination near a subway station, and the second variable far_station_trips contain trips where both the start and end destinations were far from the subway station.
#Filter for trips that start or end at a bike station located near a subway station
near_station_trips = df_trips_data[(df_trips_data['start_station_id'].isin(near_station['station_id'])) | (df_trips_data['end_station_id'].isin(near_station['station_id']))]
near_station_trips['dayofweek'] = near_station_trips.start_time.dt.dayofweek
near_station_trips['isweek_day'] = near_station_trips.start_time.dt.weekday <5
#Filter for trips that start and end at a bike station located more than 200m away from a subway station
far_station_trips = df_trips_data[(df_trips_data['start_station_id'].isin(far_station['station_id'])) & (df_trips_data['end_station_id'].isin(far_station['station_id']))]
far_station_trips['dayofweek'] = far_station_trips.start_time.dt.dayofweek
far_station_trips['isweek_day'] = far_station_trips.start_time.dt.weekday <5
#Create new variable with day of week for each
dayOfWeek={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
near_station_trips['weekday'] = near_station_trips.dayofweek.map(dayOfWeek)
far_station_trips['weekday'] = far_station_trips.dayofweek.map(dayOfWeek)
#create column to distinguish two variables
far_station_trips['loc']='far'
near_station_trips['loc']='near'
combined_station=pd.concat([far_station_trips,near_station_trips])
combined_station.head()
| subscription_id | trip_duration | start_station_id | start_time | start_station_name | end_station_id | end_time | end_station_name | bike_id | user_type | ... | weekday | hour | is_holiday | isweek_day | category | weather2 | trip_dur_min | bin | dayofweek | loc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| trip_id | |||||||||||||||||||||
| 712441 | NaN | 274 | 7006 | 2017-01-01 00:03:00-05:00 | Bay St / College St (East Side) | 7021 | 2017-01-01 00:08:00-05:00 | Bay St / Albert St | NaN | annual member | ... | Sunday | 0.050000 | True | False | Clear day | Clear | 4.566667 | 65 to 70 | 6 | far |
| 712442 | NaN | 538 | 7046 | 2017-01-01 00:03:00-05:00 | Niagara St / Richmond St W | 7147 | 2017-01-01 00:12:00-05:00 | King St W / Fraser Ave | NaN | annual member | ... | Sunday | 0.050000 | True | False | Clear day | Clear | 8.966667 | 65 to 70 | 6 | far |
| 712444 | NaN | 1005 | 7177 | 2017-01-01 00:09:00-05:00 | East Liberty St / Pirandello St | 7202 | 2017-01-01 00:26:00-05:00 | Queen St W / York St (City Hall) | NaN | annual member | ... | Sunday | 0.150000 | True | False | Clear day | Clear | 16.750000 | 65 to 70 | 6 | far |
| 712445 | NaN | 645 | 7203 | 2017-01-01 00:14:00-05:00 | Bathurst St / Queens Quay W | 7010 | 2017-01-01 00:25:00-05:00 | King St W / Spadina Ave | NaN | annual member | ... | Sunday | 0.233333 | True | False | Clear day | Clear | 10.750000 | 65 to 70 | 6 | far |
| 712446 | NaN | 660 | 7193 | 2017-01-01 00:15:00-05:00 | Queen St W / Gladstone Ave | 7123 | 2017-01-01 00:26:00-05:00 | 424 Wellington St. W | NaN | annual member | ... | Sunday | 0.250000 | True | False | Clear day | Clear | 11.000000 | 65 to 70 | 6 | far |
5 rows × 41 columns
Analysis of the hourly usage shown in Figure 42 - Hourly Distribution by Membership Type of Ride Counts for Bike Routes Far and Near Subway Station does not show any differences in the hourly usage between routes that have at least a start or end location near a subway and routes that are not near the subway.
#Set up plot
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
sns.set(font_scale=1.0)
ax2.set_xlim(0,24)
#Plot hourly bike usage for routes near subways and far from subway
sns.distplot(near_station_trips[near_station_trips['user_type']=='annual member']['hour'], label = "Near Station, Annual", ax = ax2, hist = False)
sns.distplot(near_station_trips[near_station_trips['user_type']=='casual member']['hour'], label = "Near Station, Casual", ax = ax2, hist = False)
sns.distplot(far_station_trips[far_station_trips['user_type']=='casual member']['hour'], label = "Far Station, Annual", ax = ax2, hist = False)
sns.distplot(far_station_trips[far_station_trips['user_type']=='annual member']['hour'], label = "Far Station, Casual", ax = ax2, hist = False)
#formate titles
ax2.set_title("Figure 42 - Hourly Distribution of Ride Counts for Bike Trips Far and Near Subway Station")
ax2.set_xlabel("Hourly Ride Counts")
ax2.set_ylabel("Probability Density")
ax2.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax2.legend()
plt.show()
#Delete to create more memory
del near_station_trips
del far_station_trips
The analysis of the bike usage by day of week, Figure 43 - Ride Counts by Day of Week for for Bike Trips Far and Near Subway Station show that the usage is more common on the weekdays for routes near subway, but the same trend is not observed for bike trips that are not located near subway stations.
#create a new column that will be used for category
plt.figure(figsize=(12,8))
sns.set(font_scale=1.2)
#plot the graph for daily trip count for all, casual and annual members
ax = sns.countplot(data=combined_station, y="weekday", hue="loc",
order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
ax.axes.set_title("Figure 43 - Ride Counts by Day of Week for for Bike Trips Far and Near Subway Station")
ax.set_xlabel("Ride Count")
ax.set_ylabel("Day of Week")
ax.legend("Trip Proximity to Subway Station")
plt.show()
Use the groupby.agg() function to determine the daily ride count for routes far from subway and near subway station. Figure 44 - Average Daily Rides (with 95% Confidence Interval) for Day of Week and Bike Route in Near and Far from Subway Station shows the same trend as observed in the total count by day of week ( Figure 43). For routes that are near from the subway station, the average rides are higher during the week. This is not the case for routes that are not close to subway stations.
#grouby function to find daily ride count for routes near subway and far from subway
daily_rides_station = combined_station.groupby(pd.Grouper(key="start_time",freq='D')).agg(far_rides=('loc', lambda x: (x == 'far').sum()),
near_rides=('loc', lambda x: (x == 'near').sum()),
annual_members=('user_type', lambda x: (x == 'annual member').sum()),
casual_members=('user_type', lambda x: (x == 'casual member').sum()))
#add extra columns to determine day of week
daily_rides_station['dayofweek'] = daily_rides_station.index.dayofweek
daily_rides_station['isweek_day'] = daily_rides_station.index.weekday <5
daily_rides_station['weekday'] = daily_rides_station.dayofweek.map(dayOfWeek)
daily_rides_station.head()
| far_rides | near_rides | annual_members | casual_members | dayofweek | isweek_day | weekday | |
|---|---|---|---|---|---|---|---|
| start_time | |||||||
| 2017-01-01 00:00:00-05:00 | 320 | 162 | 412 | 70 | 6 | False | Sunday |
| 2017-01-02 00:00:00-05:00 | 506 | 320 | 756 | 70 | 0 | True | Monday |
| 2017-01-03 00:00:00-05:00 | 538 | 333 | 853 | 18 | 1 | True | Tuesday |
| 2017-01-04 00:00:00-05:00 | 809 | 586 | 1361 | 34 | 2 | True | Wednesday |
| 2017-01-05 00:00:00-05:00 | 688 | 522 | 1191 | 19 | 3 | True | Thursday |
tidy = pd.melt(daily_rides_station, id_vars=['weekday'], value_vars=['far_rides','near_rides'], var_name='proximity')
plt.figure(figsize=(15,10))
sns.set(font_scale=1.5)
ax=sns.barplot(data= tidy, x='weekday',y='value',hue='proximity',
order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
ax.axes.set_title("Figure 44 - Average Daily Rides (with 95% Confidence Interval) for Day of Week \n and Route Proximmity to Subway Station")
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.set_ylabel("Average Daily Rides")
ax.set_xlabel("Day of Week")
ax.legend(title='Proximity to Subway')
plt.show()
del tidy
The annual trend in the bike usage for routes far from subway and routes near subway show an interesting trend. Figure 45 - Daily Ride Counts by Day of Year between 2017 to 2020 \n for Bike Routes Near and Far from Subway Station shows that the number of bike rides far from the subway seems to be increasing more rapidly over the years, compared to the bike rides near the subway station. This is more clearly visible in the montly ride count shown in Figure 46 - Monthly Ride Counts by Day of Year between 2017 to 2020 \n for Bike Routes Near and Far from Subway Station.
#Set up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.2)
#Use lineplot to plot the daily ride count vs day of year for each year (2017 - 2020)
ax=sns.lineplot(x=daily_rides_station.index, y=daily_rides_station.near_rides, label="near subway")
ax=sns.lineplot(x=daily_rides_station.index, y=daily_rides_station.far_rides, label="far from subway")
ax.axes.set_title("Figure 45 - Daily Ride Counts between 2017 to 2020 \n for Bike Routes Near and Far from Subway Station",
fontsize=16)
ax.set_ylabel("Daily Ride Counts")
ax.set_xlabel("Date")
ax.legend(title="Proximity to Subway Station")
#format axis
# Minor ticks every month.
fmt_month = mdates.MonthLocator(interval=1)
#define function to return first letter of every month
month_fmt = DateFormatter('%b')
def m_fmt(x, pos=None):
return month_fmt(x)[0]
ax.xaxis.set_minor_locator(MonthLocator())
ax.xaxis.set_minor_formatter(FuncFormatter(m_fmt))
# Major ticks every year
years = mdates.YearLocator()
#ax.xaxis.set_major_locator(years)
yearsFmt = mdates.DateFormatter('\n\n%Y') # add some space for the year label
ax.xaxis.set_major_formatter(yearsFmt)
plt.show()
#grouby function to find daily ride count for routes near subway and far from subway
monthly_rides_station = combined_station.groupby(pd.Grouper(key="start_time",freq='M')).agg(far_rides=('loc', lambda x: (x == 'far').sum()),
near_rides=('loc', lambda x: (x == 'near').sum()),
annual_members=('user_type', lambda x: (x == 'annual member').sum()),
casual_members=('user_type', lambda x: (x == 'casual member').sum()))
#add extra columns to determine day of week
monthly_rides_station['dayofweek'] = monthly_rides_station.index.dayofweek
monthly_rides_station['isweek_day'] = monthly_rides_station.index.weekday <5
monthly_rides_station['weekday'] = monthly_rides_station.dayofweek.map(dayOfWeek)
monthly_rides_station.head()
| far_rides | near_rides | annual_members | casual_members | dayofweek | isweek_day | weekday | |
|---|---|---|---|---|---|---|---|
| start_time | |||||||
| 2017-01-31 00:00:00-05:00 | 24219 | 16939 | 39866 | 1292 | 1 | True | Tuesday |
| 2017-02-28 00:00:00-05:00 | 24350 | 16712 | 38852 | 2210 | 1 | True | Tuesday |
| 2017-03-31 00:00:00-04:00 | 28166 | 20324 | 46482 | 2008 | 4 | True | Friday |
| 2017-04-30 00:00:00-04:00 | 46776 | 28433 | 64080 | 11129 | 6 | False | Sunday |
| 2017-05-31 00:00:00-04:00 | 62137 | 38528 | 84231 | 16434 | 2 | True | Wednesday |
#Set up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.2)
#Use lineplot to plot the daily ride count vs day of year for each year (2017 - 2020)
ax=sns.lineplot(x=monthly_rides_station.index, y=monthly_rides_station.near_rides, label="near subway")
ax=sns.lineplot(x=monthly_rides_station.index, y=monthly_rides_station.far_rides, label="far from subway")
ax.axes.set_title("Figure 46 - Monthly Ride Counts between 2017 to 2020 \n for Bike Routes Near and Far from Subway Station",
fontsize=16)
ax.set_ylabel("Monthly Ride Counts")
ax.set_xlabel("Date")
ax.legend(title="Proximity to Subway Station")
#format axis
# Minor ticks every month.
fmt_month = mdates.MonthLocator(interval=1)
#define function to return first letter of every month
month_fmt = DateFormatter('%b')
def m_fmt(x, pos=None):
return month_fmt(x)[0]
ax.xaxis.set_minor_locator(MonthLocator())
ax.xaxis.set_minor_formatter(FuncFormatter(m_fmt))
# Major ticks every year
years = mdates.YearLocator()
#ax.xaxis.set_major_locator(years)
yearsFmt = mdates.DateFormatter('\n\n%Y') # add some space for the year label
ax.xaxis.set_major_formatter(yearsFmt)
plt.show()
The city of Toronto publishes data about its current Bikeway Network. Import bike lane file 'bikeway_network.shp'. As part of this analysis, we will determine which bike paths are most congested assuming that everyone is travelling along bike paths. Once we determine which bike paths are most congested, we will look at the hourly usage along those routes to determine what times of the day the lanes are most busy.
In this analysis we will use the network analysis using the networkx python module. To be able to conduct network analysis, it is, of course, necessary to have a network that is used for the analyses. OSMnx package that we just explored in previous tutorial, makes it really easy to retrieve routable networks from OpenStreetMap with different transport modes (walking, cycling and driving). OSMnx also combines some functionalities from networkx module to make it straightforward to conduct routing along OpenStreetMap data.
#Import bike lane shape file
bike_lanes = gpd.read_file('bikeway_network.shp')
#Subset for the columnes that we need
bike_lanes = bike_lanes[['LF_NAME', 'SEG_TYPE', 'length', 'geometry']]
#Rename columns
bike_lanes = bike_lanes.rename(columns={'LF_NAME': 'name', 'SEG_TYPE': 'route_type'})
#Filter the data for just the bike lanes
bike_lanes = bike_lanes[bike_lanes['route_type'] == 'bike lane']
bike_lanes.head()
| name | route_type | length | geometry | |
|---|---|---|---|---|
| 17 | STEELES AVE E | bike lane | 615.29501 | LINESTRING (-8814007.269 5442838.691, -8813937... |
| 18 | STEELES AVE E | bike lane | 207.34282 | LINESTRING (-8814279.821 5442748.151, -8814007... |
| 19 | STEELES AVE E | bike lane | 824.67887 | LINESTRING (-8815361.211 5442380.194, -8814988... |
| 20 | STEELES AVE E | bike lane | 815.29953 | LINESTRING (-8816433.519 5442026.400, -8816384... |
| 21 | STEELES AVE E | bike lane | 63.72301 | LINESTRING (-8816516.981 5441997.707, -8816472... |
#Bike Lane CRS is EPSG 3857, convert to 26917
bike_lanes = bike_lanes.to_crs(epsg=26917)
bike_lanes.crs
<Projected CRS: EPSG:26917> Name: NAD83 / UTM zone 17N Axis Info [cartesian]: - E[east]: Easting (metre) - N[north]: Northing (metre) Area of Use: - name: North America - 84°W to 78°W and NAD83 by country - bounds: (-84.0, 23.81, -78.0, 84.0) Coordinate Operation: - name: UTM zone 17N - method: Transverse Mercator Datum: North American Datum 1983 - Ellipsoid: GRS 1980 - Prime Meridian: Greenwich
#Groupby start and end station id, identify the most common route in database
route_bike=df_trips_data.groupby(["start_station_id", "end_station_id"]).size()
#Reset index
route_bike=pd.DataFrame(route_bike)
route_bike.reset_index(inplace=True)
#Change count column name from 0 to count
route_bike.rename(columns={0: "count"}, inplace=True)
#Remove from analysis if the start and end location is the same, cannot determine route
route_bike = route_bike[route_bike['start_station_id'] != route_bike['end_station_id']]
#top 20 most common start and end destination
route_bike=route_bike.sort_values(by="count", ascending=False)
route_bike.reset_index(drop=True, inplace=True)
route_bike_top10=route_bike.head(10)
route_bike_top10
| start_station_id | end_station_id | count | |
|---|---|---|---|
| 0 | 7059 | 7033 | 5939 |
| 1 | 7203 | 7076 | 5500 |
| 2 | 7076 | 7203 | 4792 |
| 3 | 7344 | 7354 | 4066 |
| 4 | 7171 | 7242 | 3891 |
| 5 | 7242 | 7222 | 3808 |
| 6 | 7175 | 7171 | 3608 |
| 7 | 7354 | 7344 | 3434 |
| 8 | 7051 | 7042 | 3370 |
| 9 | 7220 | 7288 | 3307 |
#Make a list of unique station ids in the common routes
unique_stations=np.unique(route_bike_top10[["start_station_id", "end_station_id"]].values)
unique_stations
#Subset the bike station data to only contain station ids found in top routes
route_stations =df_stations[df_stations['station_id'].isin(unique_stations)]
route_stations = route_stations.to_crs(epsg=26917)
#Create a dataframe with the start and end locations of most popular routes
nodes_start=route_bike_top10.merge(right=route_stations[['station_id','lat','lon','geometry']],
how='left',
left_on='start_station_id', right_on='station_id')
nodes_end=route_bike_top10.merge(right=route_stations[['station_id','lat','lon','geometry']],
how='left',
left_on='end_station_id', right_on='station_id')
nodes_start=nodes_start.drop(['end_station_id'], axis=1)
nodes_end=nodes_end.drop(['start_station_id'], axis=1)
nodes_start = gpd.GeoDataFrame(nodes_start, geometry='geometry')
nodes_start=nodes_start.to_crs(epsg=4326)
nodes_end = gpd.GeoDataFrame(nodes_end, geometry='geometry')
nodes_end=nodes_end.to_crs(epsg=4326)
#A function to set up the network graph based on address or coordinate
def create_graph(loc, dist, transport_mode, loc_type="address"):
"""Transport mode = ‘walk’, ‘bike’, ‘drive’, ‘drive_service’, ‘all’, ‘all_private’, ‘none’"""
if loc_type == "address":
G = ox.graph_from_address(loc, dist=dist, network_type=transport_mode)
elif loc_type == "points":
G = ox.graph_from_point(loc, dist=dist, network_type=transport_mode )
return G
#Set up the Network graph for City of Toronto using the OSMNX plugin
G = create_graph(loc=(43.6426, -79.3871), dist=5000, transport_mode='bike', loc_type="points")
ox.plot_graph(G)
G = ox.add_edge_speeds(G) #Impute
G = ox.add_edge_travel_times(G) #Travel time
nodes_proj = ox.graph_to_gdfs(G, nodes=True)
#Functions defined to identify the shortest path between the start and end stations
def line_dataframe(route):
"""This function derive nodes, coordinates and travel time from the route and graph network"""
"""Here we will create a list that holds these values and loop through the route"""
"""Input required is the route determined using the nx.shortest_path"""
node_start = []
node_end = []
X_to = []
Y_to = []
X_from = []
Y_from = []
length = []
travel_time = []
for u, v in zip(route[:-1], route[1:]):
node_start.append(u)
node_end.append(v)
length.append(round(G.edges[(u, v, 0)]['length']))
travel_time.append(round(G.edges[(u, v, 0)]['travel_time']))
X_from.append(G.nodes[u]['x'])
Y_from.append(G.nodes[u]['y'])
X_to.append(G.nodes[v]['x'])
Y_to.append(G.nodes[v]['y'])
#Create a data frame out of the lists from the above calculations.
#We end up with a data frame that holds these values, like origin coordinates of each node in the route,
#length of the path between nodes and travel time between the nodes.
df = pd.DataFrame(list(zip(node_start, node_end, X_from, Y_from, X_to, Y_to, length, travel_time)),
columns =['node_start', 'node_end', 'X_from', 'Y_from', 'X_to', 'Y_to', 'length', 'travel_time'])
return df
def create_line_gdf(df):
"""This function takes the output from the line_dataframe function and create a LineString"""
"""Geodataframe that connects all these nodes coordinates"""
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.X_from, df.Y_from))
gdf['geometry_to'] = [Point(xy) for xy in zip(gdf.X_to, gdf.Y_to)]
gdf['line'] = gdf.apply(lambda row: LineString([row['geometry_to'], row['geometry']]), axis=1)
line_gdf = gdf[['node_start','node_end','length','travel_time', 'line']].set_geometry('line')
line_gdf.crs= {'init': 'epsg:4326'}
return line_gdf
# Create a map of Toronto with bike stations and bike lanes
map_1 = folium.Map(location=[43.6426, -79.3871],
tiles='cartodbpositron',
zoom_start=12)
#Display the bike lanes in Toronto
for idx, row in bike_lanes.to_crs(epsg=4326).iterrows():
folium.Choropleth(row.geometry,line_weight=1,line_color='red').add_to(map_1)
#Display the start and end station for Top 10 most popular routes
for idx, row in route_stations.to_crs(epsg=4326).iterrows():
Marker([row.geometry.y, row.geometry.x], popup=Popup(str(row.station_id), parse_html=True)).add_to(map_1)
#Create empty databases for data collection
combined_lines=pd.Series()
combined_lines_buffer=gpd.GeoSeries()
#for the Top 10 most popular routes, determine the shortest path between the origin and destination
for x in range(len(nodes_start)):
G = create_graph(loc=(nodes_start.geometry.y[x], nodes_start.geometry.x[x]),
dist=3500, transport_mode='bike', loc_type="points")
G = ox.add_edge_speeds(G) #Impute
G = ox.add_edge_travel_times(G) #Travel time
nodes_proj = ox.graph_to_gdfs(G, nodes=True)
orig_xy = (nodes_start.geometry.y[x], nodes_start.geometry.x[x])
target_xy = (nodes_end.geometry.y[x], nodes_end.geometry.x[x])
orig_node = ox.get_nearest_node(G, orig_xy)
target_node = ox.get_nearest_node(G, target_xy)
# Calculate the shortest path
route = nx.shortest_path(G, orig_node, target_node, weight='length')
data=line_dataframe(route)
line_gdf=create_line_gdf(data)
line_gdf2=line_gdf.geometry.unary_union
#for the route near Cherry Beach, apply a larger buffer to capture nearest bike lane
if x in [3,7]:
combined_lines_buffer=pd.concat([combined_lines_buffer, line_gdf.to_crs(epsg=26917).buffer(1050)],axis=0)
combined_lines=pd.concat([combined_lines,line_gdf.line],axis=0)
else:
combined_lines_buffer=pd.concat([combined_lines_buffer, line_gdf.to_crs(epsg=26917).buffer(400)],axis=0)
combined_lines=pd.concat([combined_lines,line_gdf.line],axis=0)
#Convert the series into a geoseries and convert EPSG to 26917
combined_lines=gpd.GeoSeries(combined_lines)
combined_lines=combined_lines.to_crs(epsg=26917)
#Convert shortest path buffer to EPSG to 4326
combined_lines_buffer=combined_lines_buffer.to_crs(epsg=4326)
#Collapse all of the linestrings into one geometry
combined_lines_union=combined_lines_buffer.unary_union
combined_lines=combined_lines.to_crs(epsg=26917)
#Display the shortest path with buffer on map
GeoJson(combined_lines).add_to(map_1)
#Display map1
map_1
Based on our observation of the shortest paths for the top 10 most popular routes (shown in blue on the above map), we have identified the following bike lanes to be most popular for the bikeshar users, assuming that they always take the bike lane for their routes.
Using a buffer around the shortest routes, the most popular bikes lanes have been identified. The bike lane on comissioner street is outside this 400m buffer zone, but has been included in the list due to its proximity to bike stations near Cherry Beach.
bike_lanes['station_proximity'] = bike_lanes.to_crs(epsg=4326).apply(lambda row: combined_lines_union.contains(row.geometry), axis=1)
top_bike_lanes = bike_lanes[bike_lanes['station_proximity']==True]
top_bike_lanes.head()
| name | route_type | length | geometry | station_proximity | |
|---|---|---|---|---|---|
| 2713 | LAKE SHORE BLVD W | bike lane | 102.39946 | LINESTRING (622464.401 4830980.009, 622459.816... | True |
| 2715 | LAKE SHORE BLVD W | bike lane | 62.43065 | LINESTRING (622416.775 4830889.895, 622402.528... | True |
| 2717 | LAKE SHORE BLVD W | bike lane | 61.93056 | LINESTRING (622375.805 4830843.152, 622362.739... | True |
| 3869 | SHERBOURNE ST | bike lane | 106.01093 | LINESTRING (630974.809 4836364.824, 630940.816... | True |
| 3897 | SHERBOURNE ST | bike lane | 52.18063 | LINESTRING (630991.721 4836315.464, 630974.809... | True |
# Create a map of Toronto with bike stations and bike lanes
map_1 = folium.Map(location=[43.6426, -79.3871],
tiles='cartodbpositron',
zoom_start=12)
#Display the bike lanes in Toronto
for idx, row in top_bike_lanes.to_crs(epsg=4326).iterrows():
folium.Choropleth(row.geometry,line_weight=1,line_color='red').add_to(map_1)
#Display the start and end station for Top 10 most popular routes
for idx, row in route_stations.to_crs(epsg=4326).iterrows():
Marker([row.geometry.y, row.geometry.x], popup=Popup(str(row.station_id), parse_html=True)).add_to(map_1)
# Display map
map_1
#Subset biketrip data to contain only Top 10 most popular routes
m1=pd.DataFrame()
for x in range(10):
m2=df_trips_data[(df_trips_data['start_station_id']==route_bike_top10['start_station_id'][x]) & (df_trips_data['end_station_id']==route_bike_top10['end_station_id'][x])]
m1=pd.concat([m1,m2],axis=0)
print('Number of trips contained in data set:',len(m1))
m1.head()
Number of trips contained in data set: 41715
| subscription_id | trip_duration | start_station_id | start_time | start_station_name | end_station_id | end_time | end_station_name | bike_id | user_type | ... | wind_chill | weather | weekday | hour | is_holiday | isweek_day | category | weather2 | trip_dur_min | bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| trip_id | |||||||||||||||||||||
| 712614 | NaN | 220 | 7059 | 2017-01-01 14:03:00-05:00 | Front St W / Blue Jays Way | 7033 | 2017-01-01 14:06:00-05:00 | Union Station | NaN | annual member | ... | NaN | clear_day | Sunday | 14.050000 | True | False | Clear day | Clear | 3.666667 | 50 to 55 |
| 713457 | NaN | 276 | 7059 | 2017-01-02 15:50:00-05:00 | Front St W / Blue Jays Way | 7033 | 2017-01-02 15:55:00-05:00 | Union Station | NaN | annual member | ... | NaN | clear_day | Monday | 15.833333 | True | True | Clear day | Clear | 4.600000 | 75 to 80 |
| 714219 | NaN | 309 | 7059 | 2017-01-03 09:39:00-05:00 | Front St W / Blue Jays Way | 7033 | 2017-01-03 09:44:00-05:00 | Union Station | NaN | annual member | ... | NaN | Rain | Tuesday | 9.650000 | False | True | Rain | Precipitation | 5.150000 | NaN |
| 714291 | NaN | 244 | 7059 | 2017-01-03 10:01:00-05:00 | Front St W / Blue Jays Way | 7033 | 2017-01-03 10:06:00-05:00 | Union Station | NaN | annual member | ... | NaN | Rain | Tuesday | 10.016667 | False | True | Rain | Precipitation | 4.066667 | NaN |
| 715168 | NaN | 203 | 7059 | 2017-01-04 08:02:00-05:00 | Front St W / Blue Jays Way | 7033 | 2017-01-04 08:06:00-05:00 | Union Station | NaN | annual member | ... | NaN | clear_day | Wednesday | 8.033333 | False | True | Clear day | Clear | 3.383333 | 75 to 80 |
5 rows × 39 columns
#Create a boolean column for identifying weekday or weekend
m1['dayofweek'] = m1.start_time.dt.dayofweek
m1['isweek_day'] = m1.start_time.dt.weekday <5
Figure 46 - Hourly Distribution of Ride Counts on Popular Bike Routes shows the distribution of rides on a hourly basis for rides along the most popular bike routes. From this we can infer when the popular bike lanes identified above will be the most congested. The most popular time during the day is between 5 and 8 PM on the weekdays. The most popular time during the day is between 2 PM and 8PM on weekends. One can expect the bike lanes around these routes to be most congested during this time periods.
fig1 = plt.figure(figsize=(10,5))
ax2 = fig1.subplots()
ax2.set_xlim(0,24)
#map data
#sns.distplot(m1.hour, hist = False, hue = m1.isweek_day)
sns.distplot(m1[m1['isweek_day']==True].hour, hist = False, label="Weekday")
sns.distplot(m1[m1['isweek_day']==False].hour, hist = False, label="Weekend")
#formate titles
ax2.set_title("Figure 46 - Hourly Distribution of Ride Counts on Popular Bike Routes",fontsize = 18)
ax2.set_xlabel("Hour of Day", fontsize = 16)
ax2.set_ylabel("Probability Density", fontsize = 16)
ax2.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax2.legend()
plt.show()
Figure 47 - Average Monthly Ride Counts between 2017 to 2020 on Top 10 Most Popular Bike Routes shows the average monthly ride count for the past 4 years over the course of the year. The most popular months for biking are between June and Septemper. For this reason, one can anticpate the congestion on the bike lanes will be most likely during these summer months.
#grouby function to find daily ride count for routes near subway and far from subway
daily_rides_m = m1.groupby(pd.Grouper(key="start_time",freq='M')).agg(counte=("user_type","count"))
daily_rides_m['month']=daily_rides_m.index.month
#Set up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.5)
#Use lineplot to plot the daily ride count vs day of year for each year (2017 - 2020)
ax=sns.lineplot(x=daily_rides_m.month, y=daily_rides_m.counte)
ax.axes.set_title("Figure 47 - Average Monthly Ride Counts between 2017 to 2020 \n on Top 10 Most Popular Bike Routes",
fontsize=16)
ax.set_ylabel("Average Monthly Ride Counts")
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.set_xlabel("Month")
plt.show()
To investigate whether there are seasonal trends in the trip duration, first we determined the daily average and plotted on a time series line plot as shown in Figure 46 - Daily Mean Trip Duration from 2017 to 2010. The graph shows that there is a seasonal trend in the trip duration, increasing over the summer months and decreasing in the winter months. It is interesting to note that there is a increase in the peak in 2020, and we speculate that this was at least partially due to the COVID-19 pandemic as more people began to use the bike for recreational purposes.
#grouby function to find daily average trip duration
daily_rides = df_trips_data.groupby(pd.Grouper(key="start_time",freq='D')).agg(trip_dur=("trip_duration","mean"),
trip_dur_max=("trip_duration","max"),
trip_dur_min=("trip_duration","min"),
temp=("temp_c",'max'))
daily_rides.head()
| trip_dur | trip_dur_max | trip_dur_min | temp | |
|---|---|---|---|---|
| start_time | ||||
| 2017-01-01 00:00:00-05:00 | 694.282158 | 2071 | 79 | 3.0 |
| 2017-01-02 00:00:00-05:00 | 631.319613 | 1996 | 88 | 4.7 |
| 2017-01-03 00:00:00-05:00 | 656.013777 | 2041 | 60 | 5.1 |
| 2017-01-04 00:00:00-05:00 | 622.412903 | 1943 | 70 | 3.9 |
| 2017-01-05 00:00:00-05:00 | 612.165289 | 2170 | 62 | -5.6 |
#Set up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.2)
#Use lineplot to plot the daily avg trip duration from 2017 to 2020
ax=sns.lineplot(x=daily_rides.index, y=daily_rides.trip_dur/60)
ax.axes.set_title("Figure 46 - Daily Mean Trip Duration from 2017 to 2010")
#Format Plot
ax.set_ylabel("Average Daily Trip Duration (min)")
ax.set_xlabel("Date")
#format axis
# Minor ticks every month.
fmt_month = mdates.MonthLocator(interval=1)
#define function to return first letter of every month
month_fmt = DateFormatter('%b')
def m_fmt(x, pos=None):
return month_fmt(x)[0]
ax.xaxis.set_minor_locator(MonthLocator())
ax.xaxis.set_minor_formatter(FuncFormatter(m_fmt))
# Major ticks every year
years = mdates.YearLocator()
#ax.xaxis.set_major_locator(years)
yearsFmt = mdates.DateFormatter('\n\n%Y') # add some space for the year label
ax.xaxis.set_major_formatter(yearsFmt)
plt.show()
As Figure 47 - Monthly Mean Trip Duration from 2017 to 2010 shows, the average trip duration reaches a peak typically between June and August at the height of summer, and begins to decrease once the weather gets colder. The average trip duration was exceptionally high for 2020.
#grouby function to find daily average trip duration
month_rides = df_trips_data.groupby(pd.Grouper(key="start_time",freq='M')).agg(trip_dur=("trip_duration","mean"),
trip_dur_max=("trip_duration","max"),
trip_dur_min=("trip_duration","min"))
month_rides['year']=month_rides.index.year
month_rides['month']=month_rides.index.month
month_rides.head()
| trip_dur | trip_dur_max | trip_dur_min | year | month | |
|---|---|---|---|---|---|
| start_time | |||||
| 2017-01-31 00:00:00-05:00 | 622.950143 | 2185 | 60 | 2017 | 1 |
| 2017-02-28 00:00:00-05:00 | 646.689957 | 2185 | 60 | 2017 | 2 |
| 2017-03-31 00:00:00-04:00 | 625.167808 | 2185 | 60 | 2017 | 3 |
| 2017-04-30 00:00:00-04:00 | 714.605340 | 2185 | 60 | 2017 | 4 |
| 2017-05-31 00:00:00-04:00 | 725.172225 | 2185 | 60 | 2017 | 5 |
#Set up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.5)
#Use lineplot to plot the monthly ride count vs day of year for each year (2017 - 2020)
ax=sns.lineplot(x=month_rides.month, y=month_rides.trip_dur/60, hue=month_rides.year)
ax.axes.set_title("Figure 47 - Monthly Mean Trip Duration from 2017 to 2010")
#Format Plot
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.set_ylabel("Trip Duration (min)")
ax.set_xlabel("Month")
plt.show()
Looking at the relationship between the max daily temperature and the mean daily trip duration as shown in Figure 48 - Correlation between Temperature and Trip Duration, there is a positive correlation between the two parameters. The variation in the trip duration also seems to increase as the temperature increases.
#Set up Plot
plt.figure(figsize=(10,5))
sns.set(font_scale=1.2)
#Scatter plot looking and distribution of daily ride counts by membership type and day of week
ride_scatter=sns.scatterplot(x=daily_rides.temp,
y=daily_rides.trip_dur/60)
ride_scatter.axes.set_title("Figure 48 - Correlation between Temperature and Trip Duration",
fontsize=16)
ride_scatter.set_xlabel("Max Daily Temperature (Celsius)",
fontsize=16)
#ride_scatter.set_ylim(0, 0.00035)
#ride_count.set_xlim(0, 14000)
ride_scatter.set_ylabel("Average Trip Duration (min)")
plt.show()
This concludes our exploratory data analysis for the Toronto Bike Share Data.